2004-04-21 03:58:15

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN Realtime & robust mutexes for Linux try 2.2

Hi All

This is a new release of the code for providing a user and kernel
space synchronization infrastructure that provides real-time friendly
behavior, priority inversion protection (through serialized unlocks,
priority inheritance and protection), deadlock detection and
robustness.

It builds upon the design of futexes and its usage model as NPTL does.

Please look at the first patch, containing the documentation for
information on how is it implemented.

High level changelog since release 2.1:

- Ported to 2.6.5

- Priority inheritance rework: now it should be fully functional:
supports SCHED_NORMAL tasks, holding multiple mutexes (without bugs),
ownership/wait chains with complex propagation rules and plays fine
with the planned priority protection support. It required
kind of invasive surgery at sched.c; need to find a way to simplify
it as it is touching some very hot paths.

- Added unserialized mode: in this mode, during unlocks, the mutex is
not kept locked (this is how NPTL works). Performance increases, but
you loose robustness guarantees.

- Greg Weeks, from TimeSYS, provided the ports to ppc and ppc64;
haven't tested myself as I lack such a machine.

- Fix nasty bug when a signal is sent to a waiter--still can be
improved.

- Fix nastier bug in changing the priority of a waiter--can't call
fuqueue_waiter_chprio() from inside a task_rq_lock/unlock pair or
bad things might happen during priority inheritance (double spin
lock).

- Updated a wee bit the consistency() interface, now called ctl(), as
it does way more than consistency control. Made the implementation a
wee bit cleaner, still very ugly though.

- Clean up some symbol names for correctness and meaning; wipe out
some debugging stuff--still a lot left around, but will clean
up IANF.

Still to-do:

- Requeue-like support for speeding up conditional variables.

- Jamie's auto unlock-method detection.

- Ugly bug triggered by running rtnptl-test-misc::str03 with:

$ (ulimit -s 32; time ./str03 -b5 -d5 > /tmp/kk)

[triggers inconsistency warning on NPTL patched with the rtnptl
patch--haven't been able to reproduce it lately using the
unserialized mode, but need to try with the serialized one again].

- Finish implementing priority protection; most of the infrastructure
is there already.

- Add a flag to the passing of timeout information to indicate
if it is relative or absolute, and as we are there, the clock
source. This way we avoid a costly kernel access in
pthread_mutex_timelock() and friends. Plan to encapsulate the whole
thing (timespec, rel/abs, clock source) in a struct copied from user
space.

- Wipe out debug stuff

- research using single bit mode for fast-path on arches without
compare-and-exchange but with test-and-set bit.

- Call fuqueue_waiter_cancel() into try_to_wake_up?

- Add CONFIG option to compile out parts or all.

- Research more into safe static allocation as proposed by Scott
Wood--the basic idea has a bunch of problems, mainly that it
kills caching and some others, but needs further research.

Did I miss anything?

The patch is split in the following parts:

1/11: documentation files
2/11: modifications to the core
3/11: Support for i386
4/11: Support for ia64
5/11: Support for ppc
6/11: Support for ppc64
7/11: kernel fuqueues
8/11: user space/kernel space tracker
9/11: user space fuqueues
10/11: kernel fulocks
11/11: user space fulocks

We have a site at http://developer.osdl.org/dev/robustmutexes
with references to all the releases, test code and NPTL
modifications (rtnptl) to use this code. As well, the patch
is there in a single file, in case you don't want to paste
them manually.



2004-04-21 03:59:03

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 1/11: documentation files

fusyn.txt | 679 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 679 insertions(+)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ Documentation/fusyn.txt Tue Feb 3 00:56:52 2004
@@ -0,0 +1,679 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+ - no robustness support without kernel cooperation
+
+ - no priority inheritance/protection
+
+ - no real-time wake up (priority based, no priority inversion
+ holes, etc)
+
+ - No way to implement detection of complex deadlock scenarios.
+
+The main objects to implement are:
+
+ - fuqueues: Priority-sorted wait queues -- other than that, mostly
+ equal to futexes. Usable from kernel and user space.
+
+ - fulock:
+
+ This is a full blown "mutex" implementation that can be used from
+ kernel and user space (with user-space fast [un]locking on
+ non-contended situations), robustness (if owner dies, ownership is
+ passed on), priority inheritance and (FUTURE) priority protection.
+
+ They are just a fuqueue that supports the ownership concept, to
+ allow for robustness and priority inheritace/protection.
+
+ It also supports serialized and non-serialized unlocks [see FAQ].
+
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority change while waiting and priority inheritance. As well, it
+helps to avoid cache line bouncing of the spin locks that protect the
+structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-KERNEL-VERSION.PATCH-VERSION.patch Patch file against Linus' tree
+fusyn-test-PATCH-VERSION.tar.gz Sample test library and code
+
+
+Contents:
+---------
+
+- quick intro
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+QUICK INTRO
+-----------
+
+Fuqueues are more or less like waitqueues and like futexes (the user
+space one, ufuqueues), but they are priority sorted and if you change
+the priority of a waiter while blocked, it will update it's position
+in the wait list.
+
+The priority sorting is O(N) in addition now, but I will change that
+sometime to be O(1) [actually, O(N) with N bounded to the max number
+of supported priorities, so O(1)].
+
+Fulocks build on top of fuqueues and just adds the concept of 'owner'
+plus some flags. A fulock can only be owned by a single task at the
+same time; a task contains a list of the fulocks it owns.
+
+This is for kernel space; for user space, each fu* structure has a
+ufu* counterpart that adds a vlocator--that is used to associate the
+user space address the kernel space pointer of the ufu* struct.. ufu*
+objects are allocated on demand; when noone is using them, they are
+released after a while (so we have caching) [use means somebody
+waiting on it to be woken up/acquire the lock or somebody owning the
+[u]fulock].
+
+To speed things up, there is a fast-mode (non-KCO) to lock/unlock
+fulocks in user-space without need for kernel intervention (just like
+NPTL's using futexes). To allow for robustness, priority inheritance
+and the like, we need to know who owns the lock, so we lock using the
+PID of the locking thread [more on this below].
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Lokier; the code is in
+include/linux/vlocator.h and kernel/vlocator.c.
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiters from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+Need to add more stuff to make fuqueues more of a waitqueue
+equivalent.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue').
+
+The 'struct fulock' holds all the fulock properties (most in the flags
+member). As well, there is an ownership list node, where all the
+fulocks that a task currently owns are linked to the task
+(task->fulock_olist).
+
+Properties of a fulock:
+
+- owner/state: locked or unlocked
+
+- mode of operation (encoded in the flags):
+
+ + pi xor pp: fulock is priority inheritance or protection (or
+ none).
+
+ + deadlock detection: lock() checks for deadlocks before allowing
+ it.
+
+ + sun-mode robustness: do robustness not-so-flexibly
+
+ + KCO (no fast path) xor non-KCO (fast path allowed).
+
+ Robustness is always enabled as it is easy for user space to
+ simulate the hangs that happen when you don't have it.
+
+- priority:
+
+ + Priority Inheritance: the priority of a fulock is that of the
+ highest priority waiter on its wait list. If no waiters, then it
+ has minimal priority.
+
+ + Priority Protection: the priority of the fulock is its priority
+ ceiling. The priority ceiling is encoded in the flags too.
+
+ + Normal (no PI or PP): the priority is the minium one (this nops
+ everything in the PI/PP mechanisms).
+
+ This priority is assigned to the ownership list node so that the
+ fulocks in the ownership list are sorted. This is importante for
+ priority inheritance and protection.
+
+- health state: healthy, dead-owner or not-recoverable (see the FAQ
+ for the definitions). It is encoded in the flags.
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+How PI/PP works is by always keeping the priorities of the
+waiters/fulocks around. The fulock has a prio that comes from the wait
+list (if PI), from the prio ceiling (if PP) or minimal (none). When
+anybody owns that fulock, the fulock is added to the ownership list by
+fulock prio order. The highest prio fulock determines the prio of the
+list, and that is what is called the "bost" priority, which is stored
+in task->boost_prio. I've hacked up the scheduler to, whenever it
+needs the prio of a task for activating it, to choose from the maximum
+prio from the real prio and the boost one. This way, when the boost is
+higher, the task is effectively boosted (see __prio_boost() in
+sched.c). The hack is still a wee hackish, need to make it more
+optimized so that we don't need to calculate the minimum everytime,
+but just once, when we change the boosting thingie.
+
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock) in user space. A vlocator is used to track
+it. Implemented in kernel/ufulock.c.
+
+Now, depending on certain parameters (arch supporting atomiooic
+compare-and-exchange, anal-retentive robustness need and something
+else I can't remember) you might one to use fast-path mode (non-KCO)
+or KCO mode. KCO stands for Kernel Controlled Ownership. In this mode,
+every [un]lock() operation goes through the kernel, without user space
+optimizations for the low contention case.
+
+In non-KCO mode (fast-path), the vfulock may have different values
+that server to define the state of a lock:
+
+0 Unlocked [can be fast-locked]
+PID (< VFULOCK_WP) Fast-locked by PID, no waiters in the
+ kernel. [can be fast-unlocked].
+VFULOCK_WP Locked by someone, kernel knows who, waiters
+ in the kernel.
+VFULOCK_DEAD Previous owner died (maybe unlocked or
+ locked), the kernel keeps the status [this is
+ effectively identical to KCO mode]).
+VFULOCK_NR Not recoverable.
+
+In KCO mode locks, we just keep the health state of the lock in the
+vfulock (to account for the volatility of KCO ufulocks in kernel
+space; if a KCO ufulock is owned, it exists in userspace; if not, it
+will end up being released back to the kmem cache):
+
+VFULOCK_HEALTHY (==VFULOCK_UNLOCKED) Fulock is healthy, normal
+VFULOCK_DEAD Fulock owner died
+VFULOCK_NR Ditto ...
+
+Now, back to non-KCO mode (the complex case :), this is how it works:
+
+When user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap on the vfulock of its PID against 0
+(VFULOCK_UNLOCKED); if it succeeds, its the owner, done; if not, it
+goes to the kernel (sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package] or do the lock() according to the rules for dead fulocks.
+
+Unlock is fairly similar: if the value is VFULOCK_{WP,DEAD}, go to the
+kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not, it
+is a PID and need to do an atomic compare and exchange of 0
+(VFULOCK_UNLOCKED) (unlock) against the PID [again, check
+vfulock_unlock()].
+
+The kernel will always maintain the value in the vfulock and the
+corresponding fulock in the 'struct ufulock' in sync [vfulock_sync()
+in kernel/ufulock.c], and will do that everytime we enter it through
+one of the fulock system calls (sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+ - Real time: the unlock procedure is realtime in the sense that it
+ is O(1) and the next owner is the highest priority one; as well,
+ the fulock (actually, the vfulock) is never unlocked in the
+ meantime, the ownership is transferred instead of unlocking the
+ lock, waking up the first waiter and waiting for it to acquire
+ it. This avoids priority inversions by lower priority threads
+ sneaking in from other processors at the worst time.
+
+ However, this has a cost: the convoy phenomenon. To avoid that,
+ the unlock can be performed in a non-serialized fashion, where
+ the fulock is unlocked and then the new owner-to-be woken up so
+ it contends for it. Check "What are the two kinds of unlock" in
+ the FAQ below.
+
+ - Deadlock checking: complex dead lock scenarios where a
+ ownership/wait chain [see definition in FAQ] is involved are
+ catched if FULOCK_FL_ERROR_CHK is set.
+
+ - Priority change support: when the priority of the waiting task
+ is changed, it's position in the list is updated. See below for
+ effects on priority inheritance.
+
+ - Robustness: when a task who is a fulock owner dies and the
+ kernel knew about it (ie: it was registered in the
+ task->fulock_list), then the fulock is made dead-owner, unlocked
+ and the next waiter gets ownership, with a -EDEADOWNER return
+ code.
+
+ This is always enabled; user space can emulate the
+ hangs+timeouts that would happen if this were not detected.
+
+ If the kernel knew nothing about it (ie: it was fast-locked),
+ then __fulock_id_owner() will fail to map the PID in the vfulock
+ to an existing task; then the current claimer would be
+ considered the owner after marking the fulock dead-owner.
+
+ Note the comments in __fulock_id_owner() for ideas on how to
+ avoid collisions due to PID reuse.
+
+ - Priority protection: when the owner is set, it's priority is
+ raised to the priority ceiling of the fulock; when it unlocks,
+ its prio is driven back to what it was before (or if there are
+ any other boosts in effect, whichever is effective).
+
+ - Priority inheritance: when a waiter queues for a fulock that has
+ the FULOCK_FL_PI bit set and its priority is higher than that of
+ the owner, it will boost the owner's priority to its own; this
+ will propagate in an ownership/wait chain (if the owner was
+ waiting on for a fulock, etc). As well, priority changes will
+ also be propagated.
+
+ The guts of these have been explained above; I should work on
+ the order of the explanations...
+
+
+FUTURE WORK
+-----------
+
+ - fucond: conditional variables; although they can be implemented
+ in user space + fuqueues, doing it in the kernel helps a lot in
+ atomicity issues (and the performance should be much better).
+
+ We tried doing that (see releases up to 1.12 in the website) and
+ generally it sucked because of the code bloat in the kernel, so
+ we decided to extirpate it.
+
+ - rw lock: only the first locker can do the fast operation; the
+ others go to the kernel to sign up. This way ownership is
+ supported. If a reader dies, nothing happens (after all, it is
+ supposed to be read-only access), but we need to keep track of
+ readers dying so they don't hold writers off. If a writer dies,
+ next locker (reader or writer) gets dead-owner.
+
+ These guys could also get, like this, PI and PP, as they would be
+ very similar to fulocks, but with two waiting lists. One for
+ writers, one for readers, and they allow many ownerships at the
+ same time (when there are readers).
+
+ Maybe different operation modes to primer writers over readers?
+ FIXME, need to explore.
+
+ - Spinlocks: they could be implemented as a trylock() on a fulock
+ for N loops, and after it'd degenerate into a mutex wait. This
+ wait they'd automagically support robustness, PI and PP.
+
+ - Barriers: futexes offer enough functionality for implementing
+ them, however wake up should be real-time (priority based). Not a
+ real issue though, as in barriers everybody is woken up. It can be
+ done also with fuqueues.
+
+ - Getting rid of the vlocator hash table and doing direct mapping
+ [so that we avoid the O(N) lookup] by storing in user space some
+ short of pointer to a in-kernel data struct. The pointer has to be
+ "validated", so that user space cannot have the kernel point to
+ some random or pontentially dangerous space.
+
+ A way would be to store two values, the pointer itself plus a
+ kernel-crypted copy that the can be used to verify.
+
+ Need more research into this.
+
+ - O(1) priority list: current plist is not O(1) in addition, because
+ it has to locate the proper position in the list where to add. I
+ plan to modify the plist code to be O(N) where N is the number of
+ priority levels, and as it is fixed at compilation time, it is
+ effectively O(1).
+
+ The idea is to have something similar to a priority array, but
+ instead of having N list heads, we have only the first node of
+ each priority being the list head, and the rest of the guys in
+ that prio hanging from him.
+
+ - Sun-mode robustness. Solaris implements robustness in a slightly
+ more restrictive way. We want to add an small compatibility layer
+ so both models can be used.
+
+ - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code
+(ENOTRECOVERABLE) indicating the new state, so that they can bail out
+and report up to their managers for what to do. As well, new
+contenders that try to acquire the lock will get also the EBADR error
+code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+ Why cannot the value of the fulock after unlock be set by user
+ space?
+
+This applies only to non-KCO (fast-path) mode fulocks.
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_WP (waiters
+blocked in the kernel). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_WP; as
+_WP is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized. Each one is
+explained in more detail in the next two Qs.
+
+You need both because the serialized one can be slower, as it might
+force a context switch.
+
+I thought initially that this would show only in synthetic benchmarks
+(very tight loop acquiring and releasing the lock against some other
+threads doing the same thing), but I was wrong. Max Hailperin pointed
+to me that what I was seeing was the "Convoy Phenomenon", documented
+by Mike Blasgen, Jim Gray, Mike Mitoma and Tom Price, in 1977
+[http://portal.acm.org/citation.cfm?id=850659&jmp=cit&dl=GUIDE&dl=ACM]
+when I was still poking in my nose and sucking my thumb.
+
+After thinking about it, I concluded it would mostly apply in the
+following conditions:
+
+- user space only [in kernel space the lock itself is the spinlock
+ that protects the 'struct fulock'; we spin and disable preemtion, so
+ there is no context switch].
+
+- non real-time environments/processes (where preemption is likely)
+
+- real-time SMP environments with non-KCO fulocks, where two tasks
+ might compete for a lock at the _same_ time (so to avoid it in this
+ case, it might be interesting to spin a wee bit in user space before
+ blocking).
+
+Now, in order to gain robustness you need serialization (*), so a
+userspace user is recommended to use serialized wake ups only when:
+
+- need *full* and complete robustness guarantee
+
+- needs real-time priority-inversion protection guarantee (in SMP, not
+ needed for UP).
+
+(*) See the next Q, but summarizing: Task A holds M, task B and C are
+waiting; A unlocks(), non-serialized, M is unlocked (in the kernel,
+the vfulock is still VFULOCK_WP), A is woken up. A goes back to user
+space and contends, sees VFULOCK_WP and goes to the kernel to lock,
+but before he gets there, it dies. Now B is stuck there because the
+fulock is unlocked and nobody knew he was waiting.
+
+A way to solve this could be to mark the task B and fulock M as B
+should lock M. If on the way of death (do_exit()) we see that, we mark
+M as dead and initiate recovery in the next waiter. FIXME: need to
+research.
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+This approach is not as heavy in forcing context switches, and thus
+can yield better performance, avoiding the convoy phenomenon.
+
+However, drawbacks are that you loose priority-inversion protection
+and the ability to guarantee robustness.
+
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+ M), a non-serialized wakeup. Thus, there are M - N guys still
+ waiting in the kernel.
+
+ In order for the unlock to be non-serialized, the waker first sets
+ the lock to UNLOCKED.
+
+ Now, how do the woken up processes know that there are still
+ processes in the kernel?
+
+ A solution is to set the vfulock not to UNLOCKED, but to _WP; this
+ way, whowever tries to acquire will see that and go down to the
+ kernel to do the lock operation.
+
+ However, it still does not solve the fact that when setting to _WP
+ and waking N, if those N die before locking, the waiters go into
+ dead-owner dead-lock.
+
+ When somebody tries to lock that, the kernel should be able to
+ notice that there are waiters and it is unlocked and thus give
+ way to the first locker with dead-owner recover --it might be too late.
+
+ Another option might be to tag the woken-up processes before they
+ exit the kernel, so that if they die, do_exit can trap it (but there
+ are many side issues to this, like how do I make sure that the N who
+ I woke up have gone through it, one has locked, the other N-1 have
+ gone to sleep, how do I clean it up and stuff like that--makes it
+ pretty ugly, not to talk about how many resources it'd need to tag it).
+
+ Gosh, it is a mess -- I would say that robust mutexes have to
+ require serialized unlocks. Period.
+
+ Not even talk about adding a short timer to verify that the thing
+ was locked...shrivers
+
+ RESEARCH: tentative ownership: when we wake up some guys who are
+ supposed to go and try to lock again, tie the fulock they should
+ lock to their task_struct and on exit(), check they don't have it
+ there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+ dead-owner dead-lock could ever happen; however, it is a sure recipe
+ for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel.
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+ the whole operation no low priority thread can acquire it in the
+ meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+ known during the operation. As well, if the new owner dies on it's
+ way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space.
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the KCO ufulocks have:
+basically, they can identify who is the owner based on the vfulock (in
+user space) or the internal kernel data structures that refer to it
+(if the vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (KCO)
+
+A fulock that has no fast-path; every operation is done through the
+kernel. This happens when:
+
+ - The fulock is locked and there are waiters on the kernel
+ - The fulock is dead (and the ownership keeps track for it)
+ - The fulock is a priority protected fulock or called with
+ FULOCK_FL_CO in the flags.
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc.
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock. Priority protection propagates through this chains, as well
+as priority changes in any part of the chain.

2004-04-21 04:01:45

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 4/11: Support for ia64

arch/ia64/kernel/entry.S | 10 +++---
include/asm-ia64/fulock.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++
include/asm-ia64/unistd.h | 8 ++++
3 files changed, 87 insertions(+), 6 deletions(-)

--- include/asm-ia64/unistd.h:1.1.1.7 Tue Apr 6 00:22:52 2004
+++ include/asm-ia64/unistd.h Tue Apr 6 02:16:25 2004
@@ -248,13 +248,19 @@
#define __NR_clock_nanosleep 1256
#define __NR_fstatfs64 1257
#define __NR_statfs64 1258
+/* Hole: 1259 */
#define __NR_reserved1 1259 /* reserved for NUMA interface */
#define __NR_reserved2 1260 /* reserved for NUMA interface */
#define __NR_reserved3 1261 /* reserved for NUMA interface */
+#define __NR_ufulock_lock 1262
+#define __NR_ufulock_unlock 1263
+#define __NR_ufulock_ctl 1264
+#define __NR_ufuqueue_wait 1265
+#define __NR_ufuqueue_wake 1266

#ifdef __KERNEL__

-#define NR_syscalls 256 /* length of syscall table */
+#define NR_syscalls 265 /* length of syscall table */

#if !defined(__ASSEMBLY__) && !defined(ASSEMBLER)

--- arch/ia64/kernel/entry.S:1.1.1.8 Tue Apr 6 00:21:45 2004
+++ arch/ia64/kernel/entry.S Tue Apr 6 02:15:29 2004
@@ -1502,11 +1502,11 @@
data8 sys_fstatfs64
data8 sys_statfs64
data8 sys_ni_syscall
- data8 sys_ni_syscall // 1260
- data8 sys_ni_syscall
- data8 sys_ni_syscall
- data8 sys_ni_syscall
- data8 sys_ni_syscall
+ data8 sys_ufulock_lock // 1260
+ data8 sys_ufulock_unlock
+ data8 sys_ufulock_ctl
+ data8 sys_ufuqueue_wait
+ data8 sys_ufuqueue_wake
data8 sys_ni_syscall // 1265
data8 sys_ni_syscall
data8 sys_ni_syscall
--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/asm-ia64/fulock.h Mon Feb 2 23:07:33 2004
@@ -0,0 +1,75 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * David P. Howell <[email protected]>
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_ia64_fulock_h__
+#define __asm_ia64_fulock_h__
+
+ /* fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock. */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_WP = 0xfffffffd, /* Waiters blocked in the kernel */
+ VFULOCK_DEAD = 0xfffffffe, /* dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* fulock is not-recoverable */
+};
+
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned retval;
+
+ /* The following should be the expansion of the cmpxchg_acq() */
+ /* macro from intrinsics.h. Needed this due to glibc builds */
+ /* issues with including asm/types.h. */
+ asm volatile ("mov ar.ccv=%0;;" :: "rO"(old_value));
+ asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv"
+ : "=r"(retval)
+ : "r"(value),
+ "r"(new_value)
+ : "memory");
+ return retval == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_ia64_fulock_h__ */

2004-04-21 04:01:06

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 2/11: modifications to the core

These are modifications needed to support fuqueues and
fulocks. They are, in a nutshell:

- Move some stuff from kernel/futex.c to futex.h to make it
usable by more people.

- Add neccesary fields to the task_struct.

- Hookup to initialization and cleanup points for process
creation and destruction.

- Hookup into the signalling and time out code for wait
cancellation, as well as for changing task priorities.

- Modify bits of the priority recalculation code so we have
the concept of the priority boost.

- Add error codes for dead-owner and not-recoverable.


include/asm-generic/errno-base.h | 4 +
include/linux/futex.h | 56 ++++++++++++++++++++
include/linux/init_task.h | 6 ++
include/linux/sched.h | 14 ++++-
kernel/Makefile | 3 -
kernel/exit.c | 3 +
kernel/fork.c | 2
kernel/futex.c | 39 +-------------
kernel/sched.c | 108 ++++++++++++++++++++++++++++++++++-----
kernel/signal.c | 16 +++++
kernel/timer.c | 6 +-
11 files changed, 207 insertions(+), 50 deletions(-)

--- include/linux/futex.h:1.1.1.2 Tue Apr 6 00:23:00 2004
+++ include/linux/futex.h Sun Nov 16 07:21:32 2003
@@ -9,8 +9,64 @@
#define FUTEX_FD (2)
#define FUTEX_REQUEUE (3)

+#ifdef __KERNEL__
+
+#include <linux/jhash.h>
+
+struct timespec;
+asmlinkage long sys_futex(u32 __user *uaddr, int op, int val,
+ struct timespec __user *utime, u32 __user *uaddr2);
+

long do_futex(unsigned long uaddr, int op, int val,
unsigned long timeout, unsigned long uaddr2, int val2);

+/*
+ * Futexes are matched on equal values of this key.
+ * The key type depends on whether it's a shared or private mapping.
+ * Don't rearrange members without looking at futex_hash_key().
+ *
+ * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
+ * We set bit 0 to indicate if it's an inode-based key.
+ */
+union futex_key {
+ struct {
+ unsigned long pgoff;
+ struct inode *inode;
+ int offset;
+ } shared;
+ struct {
+ unsigned long uaddr;
+ struct mm_struct *mm;
+ int offset;
+ } private;
+ struct {
+ unsigned long word;
+ void *ptr;
+ int offset;
+ } both;
+};
+
+static inline
+u32 futex_hash_key (const union futex_key *key)
+{
+ u32 hash = jhash2((u32*)&key->both.word,
+ (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+ key->both.offset);
+ return hash;
+}
+
+static inline
+int match_futex_key (const union futex_key *key1, const union futex_key *key2)
+{
+ return (key1->both.word == key2->both.word
+ && key1->both.ptr == key2->both.ptr
+ && key1->both.offset == key2->both.offset);
+}
+
+extern int get_futex_key (unsigned long uaddr, union futex_key *key);
+extern void get_key_refs(union futex_key *key);
+extern void drop_key_refs(union futex_key *key);
+
+#endif /* #ifdef __KERNEL__ */
#endif
--- include/linux/init_task.h:1.1.1.5 Tue Apr 6 01:51:36 2004
+++ include/linux/init_task.h Tue Apr 6 02:16:35 2004
@@ -72,6 +72,7 @@
.lock_depth = -1, \
.prio = MAX_PRIO-20, \
.static_prio = MAX_PRIO-20, \
+ .boost_prio = BOTTOM_PRIO, \
.policy = SCHED_NORMAL, \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
@@ -112,6 +113,11 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .fuqueue_wait_lock = SPIN_LOCK_UNLOCKED, \
+ .fuqueue_wait = NULL, \
+ .fuqueue_waiter = NULL, \
+ .fulock_olist = olist_INIT(&tsk.fulock_olist), \
+ .fulock_olist_lock = SPIN_LOCK_UNLOCKED, \
}


--- include/linux/sched.h:1.1.1.14 Tue Apr 6 01:51:36 2004
+++ include/linux/sched.h Tue Apr 6 02:16:35 2004
@@ -29,6 +29,7 @@
#include <linux/completion.h>
#include <linux/pid.h>
#include <linux/percpu.h>
+#include <linux/fulock-olist.h> /* olist_t */

struct exec_domain;

@@ -286,6 +287,7 @@
#define MAX_RT_PRIO MAX_USER_RT_PRIO

#define MAX_PRIO (MAX_RT_PRIO + 40)
+#define BOTTOM_PRIO INT_MAX

#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

@@ -330,6 +332,8 @@
};


+struct fuqueue;
+struct fuqueue_waiter;
struct io_context; /* See blkdev.h */
void exit_io_context(void);

@@ -369,7 +373,7 @@

int lock_depth; /* Lock depth */

- int prio, static_prio;
+ int prio, static_prio, boost_prio;
struct list_head run_list;
prio_array_t *array;

@@ -493,6 +497,11 @@

unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+ struct fuqueue *fuqueue_wait; /* waiting for this qeueue */
+ struct fuqueue_waiter *fuqueue_waiter; /* waiting for this qeueue */
+ spinlock_t fuqueue_wait_lock; /* FIXME: locking too heavy -- better sollution? */
+ olist_t fulock_olist; /* Fulock ownership list */
+ spinlock_t fulock_olist_lock;
};

static inline pid_t process_group(struct task_struct *tsk)
@@ -557,6 +566,9 @@
extern int task_curr(task_t *p);
extern int idle_cpu(int cpu);

+/* Set the boost priority */
+extern unsigned __prio_boost (task_t *, int);
+
void yield(void);

/*
--- kernel/Makefile:1.1.1.10 Tue Apr 6 01:51:37 2004
+++ kernel/Makefile Tue Apr 6 02:16:42 2004
@@ -7,7 +7,8 @@
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o
+ kthread.o \
+ fuqueue.o fulock.o vlocator.o ufuqueue.o ufulock.o

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
--- kernel/exit.c:1.1.1.11 Tue Apr 6 01:51:37 2004
+++ kernel/exit.c Tue Apr 6 02:16:42 2004
@@ -743,6 +743,8 @@

}

+extern void exit_fulocks(struct task_struct *);
+
asmlinkage NORET_TYPE void do_exit(long code)
{
struct task_struct *tsk = current;
@@ -771,6 +773,7 @@
}

acct_process(code);
+ exit_fulocks(tsk);
__exit_mm(tsk);

exit_sem(tsk);
--- kernel/fork.c:1.1.1.14 Tue Apr 6 01:51:37 2004
+++ kernel/fork.c Tue Apr 6 02:16:42 2004
@@ -41,6 +41,7 @@

extern int copy_semundo(unsigned long clone_flags, struct task_struct *tsk);
extern void exit_sem(struct task_struct *tsk);
+extern void init_fulock (struct task_struct *task);

/* The idle threads do not count..
* Protected by write_lock_irq(&tasklist_lock)
@@ -964,6 +965,7 @@
goto bad_fork_cleanup_signal;
if ((retval = copy_namespace(clone_flags, p)))
goto bad_fork_cleanup_mm;
+ init_fulock(p);
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
--- kernel/futex.c:1.1.1.7 Tue Apr 6 00:23:05 2004
+++ kernel/futex.c Tue Apr 6 02:16:42 2004
@@ -41,31 +41,6 @@

#define FUTEX_HASHBITS 8

-/*
- * Futexes are matched on equal values of this key.
- * The key type depends on whether it's a shared or private mapping.
- * Don't rearrange members without looking at hash_futex().
- *
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
-union futex_key {
- struct {
- unsigned long pgoff;
- struct inode *inode;
- int offset;
- } shared;
- struct {
- unsigned long uaddr;
- struct mm_struct *mm;
- int offset;
- } private;
- struct {
- unsigned long word;
- void *ptr;
- int offset;
- } both;
-};

/*
* We use this hashed waitqueue instead of a normal wait_queue_t, so
@@ -109,9 +84,7 @@
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
- u32 hash = jhash2((u32*)&key->both.word,
- (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
- key->both.offset);
+ u32 hash = futex_hash_key (key);
return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
}

@@ -120,9 +93,7 @@
*/
static inline int match_futex(union futex_key *key1, union futex_key *key2)
{
- return (key1->both.word == key2->both.word
- && key1->both.ptr == key2->both.ptr
- && key1->both.offset == key2->both.offset);
+ return match_futex_key (key1, key2);
}

/*
@@ -137,7 +108,7 @@
*
* Should be called with &current->mm->mmap_sem but NOT any spinlocks.
*/
-static int get_futex_key(unsigned long uaddr, union futex_key *key)
+int get_futex_key(unsigned long uaddr, union futex_key *key)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
@@ -232,7 +203,7 @@
* NOTE: mmap_sem MUST be held between get_futex_key() and calling this
* function, if it is called at all. mmap_sem keeps key->shared.inode valid.
*/
-static inline void get_key_refs(union futex_key *key)
+inline void get_key_refs(union futex_key *key)
{
if (key->both.ptr != 0) {
if (key->both.offset & 1)
@@ -246,7 +217,7 @@
* Drop a reference to the resource addressed by a key.
* The hash bucket spinlock must not be held.
*/
-static void drop_key_refs(union futex_key *key)
+inline void drop_key_refs(union futex_key *key)
{
if (key->both.ptr != 0) {
if (key->both.offset & 1)
--- kernel/sched.c:1.1.1.18 Tue Apr 6 01:51:37 2004
+++ kernel/sched.c Wed Apr 14 01:51:51 2004
@@ -33,6 +33,7 @@
#include <linux/suspend.h>
#include <linux/blkdev.h>
#include <linux/delay.h>
+#include <linux/fuqueue.h>
#include <linux/smp.h>
#include <linux/timer.h>
#include <linux/rcupdate.h>
@@ -356,13 +357,10 @@
*
* Both properties are important to certain workloads.
*/
-static int effective_prio(task_t *p)
+static inline int __effective_prio(task_t *p)
{
int bonus, prio;

- if (rt_task(p))
- return p->prio;
-
bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

prio = p->static_prio - bonus;
@@ -372,6 +370,13 @@
prio = MAX_PRIO-1;
return prio;
}
+static int effective_prio(task_t *p)
+{
+ int new_prio;
+ new_prio = rt_task(p)? p->prio : __effective_prio(p);
+ return min (new_prio, p->boost_prio);
+}
+

/*
* __activate_task - move a task to the runqueue.
@@ -653,7 +658,8 @@
int success = 0;
long old_state;
runqueue_t *rq;
-
+
+
repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
@@ -733,6 +739,8 @@
*/
p->thread_info->preempt_count = 1;
#endif
+ /* Initially the task has no priority boosting */
+ p->boost_prio = BOTTOM_PRIO;
/*
* Share the timeslice between parent and child, thus the
* total amount of pending timeslices in the system doesn't change,
@@ -1772,6 +1780,9 @@
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. Check its doc to learn why.
*/
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int sync)
@@ -1783,6 +1794,8 @@
unsigned flags;
curr = list_entry(tmp, wait_queue_t, task_list);
flags = curr->flags;
+ if (unlikely(curr->task->fuqueue_wait != NULL))
+ fuqueue_waiter_cancel(curr->task, -EINTR);
if (curr->func(curr, mode, sync) &&
(flags & WQ_FLAG_EXCLUSIVE) &&
!--nr_exclusive)
@@ -1965,12 +1978,17 @@

void scheduling_functions_end_here(void) { }

+/*
+ * Note the initialization of old_prio and new_dynamic_prio. If we
+ * fall back through 'out_unlock', they will help to skip the call to
+ * fuqueue_waiter_chprio().
+ */
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
prio_array_t *array;
runqueue_t *rq;
- int old_prio, new_prio, delta;
+ int old_prio = p->prio, new_prio, delta;

if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
@@ -1993,11 +2011,12 @@
if (array)
dequeue_task(p, array);

- old_prio = p->prio;
new_prio = NICE_TO_PRIO(nice);
delta = new_prio - old_prio;
p->static_prio = NICE_TO_PRIO(nice);
p->prio += delta;
+ old_prio = p->prio;
+ p->prio = min (p->prio, p->boost_prio);

if (array) {
enqueue_task(p, array);
@@ -2010,6 +2029,8 @@
}
out_unlock:
task_rq_unlock(rq, &flags);
+ if (old_prio != p->prio && p->fuqueue_wait != NULL)
+ fuqueue_waiter_chprio(p);
}

EXPORT_SYMBOL(set_user_nice);
@@ -2102,20 +2123,79 @@
return pid ? find_task_by_pid(pid) : current;
}

+
+/**
+ * Boost the priority of a task from a new dynamic priority.
+ *
+ * On SCHED_NORMAL, sets boost_prio in as __effective_prio()
+ * would do to get the same prio when it is reinserted in the list.
+ *
+ * @p: Pointer to the task in question
+ * @prio: New boost priority to set
+ * @returns: !0 if the final new dynamic priority of the task has
+ * changed, 0 otherwise.
+ *
+ * This does not do fuqueue priority propagation (to avoid infinite
+ * recursion in the fuqueue code).
+ */
+unsigned __prio_boost(task_t *p, int prio)
+{
+ runqueue_t *rq;
+ prio_array_t *array;
+ long flags;
+ int old_prio, new_dynamic_prio, newprio;
+
+ if (p->boost_prio == prio)
+ return 0;
+
+ rq = task_rq_lock(p, &flags);
+ old_prio = p->prio;
+ array = p->array;
+ if (array)
+ deactivate_task(p, task_rq(p));
+ p->boost_prio = prio;
+ new_dynamic_prio = p->policy != SCHED_NORMAL?
+ MAX_USER_RT_PRIO - 1 - p->rt_priority
+ : __effective_prio(p);
+ newprio = min (new_dynamic_prio, p->boost_prio);
+ p->prio = newprio;
+ if (array) {
+ __activate_task(p, task_rq(p));
+ if (rq->curr == p) {
+ if (p->prio > old_prio)
+ resched_task(rq->curr);
+ }
+ else if (TASK_PREEMPTS_CURR (p, rq))
+ resched_task(rq->curr);
+ }
+ task_rq_unlock (rq, &flags);
+ return old_prio != newprio;
+}
+
+
/* Actually do priority change: must hold rq lock. */
static void __setscheduler(struct task_struct *p, int policy, int prio)
{
+ int newprio, oldprio;
+
BUG_ON(p->array);
p->policy = policy;
p->rt_priority = prio;
if (policy != SCHED_NORMAL)
- p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
+ newprio = MAX_USER_RT_PRIO-1 - p->rt_priority;
else
- p->prio = p->static_prio;
+ newprio = p->static_prio;
+ oldprio = p->prio;
+ p->prio = min (newprio, p->boost_prio);
}

/*
* setscheduler - change the scheduling policy and/or RT priority of a thread.
+ *
+ * Note the initialization of old_prio. If we fall back through
+ * 'out_unlock*', it will help to skip the call to
+ * fuqueue_waiter_chprio(); this way we avoid the extra check on
+ * 'retval == 0'.
*/
static int setscheduler(pid_t pid, int policy, struct sched_param __user *param)
{
@@ -2150,7 +2230,7 @@
* runqueue lock must be held.
*/
rq = task_rq_lock(p, &flags);
-
+ oldprio = p->prio;
if (policy < 0)
policy = p->policy;
else {
@@ -2186,7 +2266,6 @@
if (array)
deactivate_task(p, task_rq(p));
retval = 0;
- oldprio = p->prio;
__setscheduler(p, policy, lp.sched_priority);
if (array) {
__activate_task(p, task_rq(p));
@@ -2204,9 +2283,10 @@

out_unlock:
task_rq_unlock(rq, &flags);
+ if (oldprio != p->prio && p->fuqueue_wait != NULL)
+ fuqueue_waiter_chprio(p);
out_unlock_tasklist:
read_unlock_irq(&tasklist_lock);
-
out_nounlock:
return retval;
}
@@ -2868,6 +2948,7 @@
struct task_struct *p;
struct runqueue *rq;
unsigned long flags;
+ unsigned old_prio;

switch (action) {
case CPU_UP_PREPARE:
@@ -2877,8 +2958,11 @@
kthread_bind(p, cpu);
/* Must be high prio: stop_machine expects to yield to it. */
rq = task_rq_lock(p, &flags);
+ old_prio = p->prio;
__setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1);
task_rq_unlock(rq, &flags);
+ if (old_prio != p->prio && p->fuqueue_wait != NULL)
+ fuqueue_waiter_chprio (p);
cpu_rq(cpu)->migration_thread = p;
break;
case CPU_ONLINE:
--- kernel/signal.c:1.1.1.12 Tue Apr 6 01:51:37 2004
+++ kernel/signal.c Wed Apr 14 20:21:24 2004
@@ -521,6 +521,8 @@
return signr;
}

+extern void fuqueue_waiter_cancel (struct task_struct *, int);
+
/*
* Tell a process that it has a new active signal..
*
@@ -544,12 +546,21 @@
* executing another processor and just now entering stopped state.
* By calling wake_up_process any time resume is set, we ensure
* the process will wake up and handle its stop or death signal.
+ *
+ * fuqueue_waiter_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. The reason is that
+ * when we resched a process that has boosted another one, we
+ * need to kick its butt off the CPU (and lower its priority) ASAP
+ * so that 't' can run.
*/
mask = TASK_INTERRUPTIBLE;
if (resume)
mask |= TASK_STOPPED;
- if (!wake_up_state(t, mask))
+ if (unlikely (t->fuqueue_wait != NULL))
+ fuqueue_waiter_cancel(t, -EINTR);
+ if (!wake_up_state(t, mask)) {
kick_process(t);
+ }
}

/*
@@ -672,6 +683,9 @@
set_tsk_thread_flag(t, TIF_SIGPENDING);
state |= TASK_INTERRUPTIBLE;
}
+ /* FIXME: I am not that sure we need to cancel here */
+ if (unlikely(t->fuqueue_wait != NULL))
+ fuqueue_waiter_cancel(t, -EINTR);
wake_up_state(t, state);

t = next_thread(t);
--- kernel/timer.c:1.1.1.13 Tue Apr 6 01:51:37 2004
+++ kernel/timer.c Tue Apr 6 02:16:42 2004
@@ -31,6 +31,7 @@
#include <linux/time.h>
#include <linux/jiffies.h>
#include <linux/cpu.h>
+#include <linux/fuqueue.h>

#include <asm/uaccess.h>
#include <asm/div64.h>
@@ -967,7 +968,10 @@

static void process_timeout(unsigned long __data)
{
- wake_up_process((task_t *)__data);
+ struct task_struct *task = (task_t *) __data;
+ if (unlikely(task->fuqueue_wait != NULL))
+ fuqueue_waiter_cancel(task, -ETIMEDOUT);
+ wake_up_process(task);
}

/**
--- include/asm-generic/errno-base.h:1.1.1.1 Thu Jul 10 12:27:27 2003
+++ include/asm-generic/errno-base.h Sun Nov 16 07:21:32 2003
@@ -36,4 +36,8 @@
#define EDOM 33 /* Math argument out of domain of func */
#define ERANGE 34 /* Math result not representable */

+ /* FIXME: ugly hack to avoid conflicts -- need to get better numbers */
+#define EOWNERDEAD 525 /* Mutex owner died */
+#define ENOTRECOVERABLE 526 /* Mutex state is not recoverable */
+
#endif

2004-04-21 04:03:50

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 3/11: Support for i386

arch/i386/kernel/entry.S | 5 +++
include/asm-i386/fulock.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++
include/asm-i386/unistd.h | 7 +++-
3 files changed, 81 insertions(+), 1 deletion(-)

--- arch/i386/kernel/entry.S:1.1.1.13 Tue Apr 6 01:51:19 2004
+++ arch/i386/kernel/entry.S Tue Apr 6 02:15:23 2004
@@ -882,5 +882,10 @@
.long sys_utimes
.long sys_fadvise64_64
.long sys_ni_syscall /* sys_vserver */
+ .long sys_ufulock_lock
+ .long sys_ufulock_unlock /* 275 */
+ .long sys_ufulock_ctl
+ .long sys_ufuqueue_wait
+ .long sys_ufuqueue_wake /* 278 */

syscall_table_size=(.-sys_call_table)
--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/asm-i386/fulock.h Mon Feb 2 23:07:28 2004
@@ -0,0 +1,70 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_i386_fulock_h__
+#define __asm_i386_fulock_h__
+
+ /* fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock. */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_WP = 0xfffffffd, /* Waiters blocked in the kernel */
+ VFULOCK_DEAD = 0xfffffffe, /* dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* fulock is not-recoverable */
+};
+
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned result;
+ asm __volatile__ (
+ "lock cmpxchg %3, %1"
+ : "=a" (result), "+m" ((*value))
+ : "a" (old_value), "r" (new_value)
+ : "memory");
+ return result == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_i386_fulock_h__ */
--- include/asm-i386/unistd.h:1.1.1.8 Tue Apr 6 01:51:33 2004
+++ include/asm-i386/unistd.h Tue Apr 6 02:16:24 2004
@@ -279,8 +279,13 @@
#define __NR_utimes 271
#define __NR_fadvise64_64 272
#define __NR_vserver 273
+#define __NR_ufulock_lock 274
+#define __NR_ufulock_unlock 275
+#define __NR_ufulock_ctl 276
+#define __NR_ufuqueue_wait 277
+#define __NR_ufuqueue_wake 278

-#define NR_syscalls 274
+#define NR_syscalls 279

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

2004-04-21 04:03:49

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 6/11: Support for ppc64

arch/ppc64/kernel/misc.S | 5 +++
include/asm-ppc64/fulock.h | 68 +++++++++++++++++++++++++++++++++++++++++++++
include/asm-ppc64/unistd.h | 7 +++-
3 files changed, 79 insertions(+), 1 deletion(-)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/asm-ppc64/fulock.h Wed Apr 14 21:01:45 2004
@@ -0,0 +1,68 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_ppc64_fulock_h__
+#define __asm_ppc64_fulock_h__
+
+/*
+ * fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock.
+ */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_KCO = 0xfffffffd, /* KCO: kernel controls ownership */
+ VFULOCK_DEAD = 0xfffffffe, /* KCO: dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* KCO: fulock is not-recoverable */
+};
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned result;
+
+ result = cmpxchg (value, old_value, new_value);
+ return result == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_ppc64_fulock_h__ */
--- include/asm-ppc64/unistd.h:1.1.1.4 Tue Apr 6 00:22:57 2004
+++ include/asm-ppc64/unistd.h Wed Apr 14 21:01:45 2004
@@ -266,8 +266,13 @@
#define __NR_fstatfs64 253
#define __NR_fadvise64_64 254
#define __NR_rtas 255
+#define __NR_ufulock_lock 256
+#define __NR_ufulock_unlock 257
+#define __NR_ufulock_ctl 258
+#define __NR_ufuqueue_wait 259
+#define __NR_ufuqueue_wake 260

-#define __NR_syscalls 256
+#define __NR_syscalls 261
#ifdef __KERNEL__
#define NR_syscalls __NR_syscalls
#endif
--- arch/ppc64/kernel/misc.S:1.1.1.5 Tue Apr 6 00:22:01 2004
+++ arch/ppc64/kernel/misc.S Wed Apr 14 21:01:45 2004
@@ -1087,3 +1087,8 @@
.llong .sys_fstatfs64
.llong .sys_ni_syscall /* 32bit only fadvise64_64 */
.llong .ppc_rtas /* 255 */
+ .llong .sys_ufulock_lock
+ .llong .sys_ufulock_unlock
+ .llong .sys_ufulock_ctl
+ .llong .sys_ufuqueue_wait
+ .llong .sys_ufuqueue_wake

2004-04-21 04:08:00

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 7/11: kernel fuqueues

include/linux/fuqueue.h | 468 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/plist.h | 202 ++++++++++++++++++++
kernel/fuqueue.c | 209 +++++++++++++++++++++
3 files changed, 879 insertions(+)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/linux/fuqueue.h Tue Apr 6 02:29:42 2004
@@ -0,0 +1,468 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Quick usage guide:
+ *
+ * struct fuqueue f = fuqueue_INIT (&f);
+ *
+ * fuqueue_wait (&f, 4343) Wait for max 4343 jiffies
+ * fuqueue_wake (&f, 5, -EPERM) Wake the first five guys waiting on f
+ * with -EPERM.
+ *
+ * These are simple priority-sorted wait queues that can be used from
+ * the kernel. They provide the foundation for more complex items
+ * (fulocks, fuconds, ufulocks, ufuconds) and use from user-space
+ * (ufuqueues, just like futexes).
+ *
+ * The type wlist_t provides the sorting for the wait list. The type
+ * 'struct fuqueue_waiter' represents a waiting task on a fuqueue. The
+ * 'struct fuqueue_ops' is what allows extensibility for other
+ * synchronization primitives.
+ *
+ * I have just realized that this is too similar to the wait_queues...
+ */
+
+#ifndef __linux_fuqueue_h__
+#define __linux_fuqueue_h__
+
+#ifdef __KERNEL__
+
+#include <linux/errno.h>
+#include <linux/spinlock.h>
+#include <linux/plist.h>
+#include <linux/sched.h>
+#include <asm/hardirq.h>
+
+ /*
+ * Debug stuff -- move forward for seeing the real meat
+ *
+ * Levels
+ *
+ * 0 Nothing activated, all compiled out
+ * > 0 Activates assertions, memory allocation tracking
+ * > 1 Activates debug messages
+ * > 2 Activates [most] function traces
+ * > 3 Activates random debug stuff
+ */
+
+#undef DEBUG
+#define DEBUG 0
+
+#if DEBUG > 0
+#if 0 || !defined(__i386__)
+#define __debug_printstr(a...) printk(a) /* Dump to normal console */
+#else /* Dump straight to ttyS0 */
+#include <linux/serial_reg.h>
+#include <linux/stringify.h>
+static inline void __debug_outb (unsigned val, int port) {
+ __asm__ __volatile__ ("outb %b0,%w1" : : "a" (val), "Nd" (port));
+}
+static inline unsigned __debug_inb (int port) {
+ unsigned value;
+ __asm__ __volatile__ ("inb %w1,%b0" : "=a" (value) : "Nd" (port));
+ return value;
+}
+static inline
+void __debug_printstr (const char *str) {
+ const int port = 0x03f8;
+ while (*str) {
+ while (!(__debug_inb (port + UART_LSR) & UART_LSR_THRE));
+ __debug_outb (*str++, port+UART_TX);
+ }
+ __debug_outb ('\r', port + UART_TX);
+}
+#endif
+
+extern spinlock_t __debug_lock;
+
+static inline
+u64 __tsc_read (void)
+{
+ u64 tsc;
+#if defined(__i386__)
+ __asm__ __volatile__("rdtsc" : "=A" (tsc));
+#elif defined (__ia64__)
+ __asm__ __volatile__("mov %0=ar.itc" : "=r" (tsc) : : "memory");
+#else
+#warning "Architecture not supported in __tsc_read()!"
+ tsc = 0;
+#endif
+ return tsc;
+}
+
+#define __debug(a...) \
+do { \
+ /* Dirty: Try to avoid >1 CPUs printing ... will suck */ \
+ char __X_buf[256]; \
+ unsigned __X_len; \
+ unsigned long __X_flags; \
+ __X_len = snprintf (__X_buf, 255, "%Lu: %s:%d: %s[%d:%d] ", \
+ __tsc_read(), __FILE__, __LINE__, __FUNCTION__, \
+ current->pid, current->thread_info->cpu); \
+ snprintf (__X_buf + __X_len, 255 - __X_len, a); \
+ spin_lock_irqsave (&__debug_lock, __X_flags); \
+ __debug_printstr (__X_buf); \
+ spin_unlock_irqrestore (&__debug_lock, __X_flags); \
+} while (0)
+#endif /* #if DEBUG > 0 */
+
+/* The real debug statements */
+
+#if DEBUG > 0
+#define ldebug(l,a...) do { if (DEBUG >= l) __debug (a); } while (0)
+#define debug(a...) ldebug(1,a)
+#define fdebug(f, a...) do { if ((DEBUG >= 2) && f) __debug (a); } while (0)
+#define __ftrace(l,a...) do { if ((l)) __debug (a); } while (0)
+#define ftrace(a...) __ftrace((DEBUG >= 2),a)
+#define assert(c, a...) do { if ((DEBUG >= 0) && !(c)) BUG(); } while (0)
+#else
+#define ldebug(l,a...)
+#define debug(a...)
+#define fdebug(f, a...)
+#define __ftrace(l,a...)
+#define ftrace(a...)
+#define assert(c, a...)
+#endif
+
+
+
+
+/**
+ * Wait list type, O(N) addition, O(1) removal
+ *
+ * This is just a redirection of the methods used to manage the list
+ * of waiters that are waiting for a fuqueue. For hardcore-timing
+ * applications, you might want to use a PALIST instead of a PLIST,
+ * but FIXME, it is not done yet :)
+ */
+
+typedef struct plist wlist_t;
+
+#define FULOCK_WLIST_USE_PLIST
+#ifdef FULOCK_WLIST_USE_PLIST
+#define wlist_add plist_add
+#define wlist_rem plist_rem
+#define __wlist_rem __plist_rem
+#define wlist_first plist_first
+#define wlist_empty plist_empty
+#define wlist_INIT plist_INIT
+#define wlist_init plist_init
+#define wlist_last plist_last
+#define wlist_update_prio plist_update_prio
+#define wlist_prio plist_prio
+#define wlist_chprio plist_chprio
+#define __wlist_chprio __plist_chprio
+#if 0
+#define wlist_for_each_safe plist_for_each_safe
+#endif
+#endif
+
+struct task_struct;
+struct fuqueue;
+
+/** Descriptor of a waiting task */
+struct fuqueue_waiter {
+ wlist_t wlist_node; /* node for the wait list */
+ struct task_struct *task; /* task that is waiting */
+ int result; /* what happened */
+};
+
+/**
+ * Operations on a fuqueue.
+ *
+ * FIXME: is it worth to have get/put? maybe they should be enforced
+ * for every fuqueue, this way we don't have to query the ops
+ * structure for the get/put method and if it is there, call
+ * it. We'd have to move the get/put ops over to the vlocator,
+ * but that's not much of a problem.
+ *
+ * The decission factor is that an atomic operation needs to
+ * lock the whole bus and is not as scalable as testing a ptr
+ * for NULLness.
+ *
+ * For simplicity, probably the idea of forcing the refcount in
+ * the fuqueue makes sense.
+ */
+struct fuqueue_ops {
+ void (* get) (struct fuqueue *);
+ void (* put) (struct fuqueue *);
+ unsigned (* waiter_cancel) (struct fuqueue *, struct fuqueue_waiter *);
+ struct task_struct * (* waiter_chprio) (
+ struct task_struct *, struct fuqueue *,
+ struct fuqueue_waiter *);
+};
+
+/** A fuqueue, a prioritized wait queue usable from kernel space. */
+struct fuqueue {
+ spinlock_t lock;
+ wlist_t wlist;
+ struct fuqueue_ops *ops;
+};
+
+
+/** Initialize a @fuqueue structure with given @ops */
+static inline
+void __fuqueue_init (struct fuqueue *fuqueue, struct fuqueue_ops *ops)
+{
+ spin_lock_init (&fuqueue->lock);
+ wlist_init (&fuqueue->wlist);
+ fuqueue->ops = ops;
+}
+
+/** Statically initialize a @fuqueue with given @ops. */
+#define __fuqueue_INIT(fuqueue, ops) { \
+ .lock = SPIN_LOCK_UNLOCKED, \
+ .wlist = wlist_INIT (&(fuqueue)->wlist), \
+ .ops = (ops) \
+}
+
+
+/** fuqueue operations for in-kernel usage */
+extern struct fuqueue_ops fuqueue_ops;
+
+
+/** Initialize a @fuqueue for usage within the kernel */
+static inline
+void fuqueue_init (struct fuqueue *fuqueue)
+{
+ __fuqueue_init (fuqueue, &fuqueue_ops);
+}
+
+/** Statically initialize a @fuqueue for usage within the kernel */
+#define fuqueue_INIT(fuqueue) __fuqueue_INIT(fuqueue, &fuqueue_ops)
+
+
+/** Wait for a @fuqueue to be woken for as much as @timeout, @returns
+ * wake up code */
+extern int fuqueue_wait (struct fuqueue *fuqueue, signed long timeout);
+
+/* Cancel the wait on a fuqueue */
+extern void fuqueue_waiter_cancel (struct task_struct *, int);
+
+/*
+ * The following are functions to be able to access fuqueue
+ * functionality when building on top of it, like the [u]fulocks,
+ * [u]fuconds and ufuqueues.
+ */
+
+#if DEBUG > 0
+/* BUG_ON() firing? Temporary fix, do you have CONFIG_PREEMPT enabled?
+ * either that or disable DEBUG (force #define it to zero). */
+#define CHECK_IRQs() do { BUG_ON (!in_atomic()); } while (0)
+#else
+#define CHECK_IRQs() do {} while (0)
+#endif
+
+
+/**
+ * Setup @current to wait for a fuqueue.
+ *
+ * This only setups, it does not block.
+ *
+ * @fuqueue: fuqueue to wait on.
+ * @w: waiter structure to fill up and queue.
+ * @returns !0 if the wlist prio changed.
+ *
+ * Fills up @current's fuqueue_wait* info and queues up @w after
+ * filling it.
+ *
+ * WARNING: Call with preempt and local IRQs disabled
+ */
+static inline
+unsigned __fuqueue_waiter_queue (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ ftrace ("(%p, %p)\n", fuqueue, w);
+ CHECK_IRQs();
+
+ _raw_spin_lock (&current->fuqueue_wait_lock);
+ current->fuqueue_wait = fuqueue;
+ current->fuqueue_waiter = w;
+ _raw_spin_unlock (&current->fuqueue_wait_lock);
+ w->task = current;
+ w->result = INT_MAX;
+ __wlist_chprio (&w->wlist_node, current->prio);
+ return wlist_add (&fuqueue->wlist, &w->wlist_node);
+}
+
+
+/**
+ * Wakes up a single waiter and cleans up it's task wait information.
+ *
+ * Needs to be split from __fuqueue_wake_waiter() as
+ * fuqueue_wake_cancel() needs to acquire the locks in a funny way to
+ * prevent issues.
+ *
+ * WARNING: call with preeempt and local IRQs disabled!!
+ */
+static inline
+void __fuqueue_waiter_unqueue (struct fuqueue_waiter *w, int result)
+{
+ struct task_struct *task = w->task;
+
+ ftrace ("(%p [%d], %d)\n", w, w->task->pid, result);
+ CHECK_IRQs();
+
+ __wlist_rem (&w->wlist_node);
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter->result = result;
+ task->fuqueue_waiter = NULL;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+}
+
+
+/**
+ * Wait for a @fuqueue until woken up, @returns wake up code
+ *
+ * Needs to be called with the fuqueue lock held, local IRQs disabled
+ * and preempt disabled. Will release the lock, enable IRQs and
+ * preemtion.
+ */
+static inline
+int __fuqueue_waiter_block (struct fuqueue *fuqueue, struct fuqueue_waiter *w,
+ signed long timeout)
+{
+ ftrace ("(%p, %p, %ld)\n", fuqueue, w, timeout);
+ CHECK_IRQs();
+
+ set_current_state (TASK_INTERRUPTIBLE);
+ _raw_spin_unlock (&fuqueue->lock);
+ local_irq_enable();
+ preempt_enable_no_resched();
+
+ /* Wait until we are woken up */
+ timeout = schedule_timeout (timeout);
+ set_current_state (TASK_RUNNING);
+ /*
+ * Now, whoever woke us up had to call first
+ * fuqueue->ops->waiter_cancel() through fuqueue_waiter_cancel(),
+ * and thus, unqueued us and set up a return code in
+ * w->result. However, if w->result is still pristine as left
+ * by __fuqueue_wait_queue(), that means our waker either
+ * didn't know we were waiting or we have signal(s) pending,
+ * so we do it ourselves.
+ */
+ if (unlikely (w->result == INT_MAX)) {
+ int result = -EINTR;
+ BUG_ON (wlist_empty (&w->wlist_node));
+ if (timeout == 0)
+ result = -ETIMEDOUT;
+ else
+ WARN_ON (!signal_pending (current));
+ fuqueue_waiter_cancel (current, result);
+ }
+ return w->result;
+}
+
+
+
+/** @Returns true if the @fuqueue has no waiters. */
+static inline
+unsigned __fuqueue_empty (const struct fuqueue *fuqueue)
+{
+ return wlist_empty (&fuqueue->wlist);
+}
+
+
+/** Return the first waiter on a @fuqueue */
+static inline
+struct fuqueue_waiter * __fuqueue_first (struct fuqueue *fuqueue)
+{
+ return container_of (wlist_first (&fuqueue->wlist),
+ struct fuqueue_waiter, wlist_node);
+}
+
+/** Wake @howmany @fuqueue waiters with @code. */
+static inline
+void __fuqueue_wake (struct fuqueue *fuqueue, size_t howmany, int code)
+{
+ struct fuqueue_waiter *w = NULL;
+
+ ftrace ("(%p, %zu, %d)\n", fuqueue, howmany, code);
+
+ while (howmany-- && !__fuqueue_empty (fuqueue)) {
+ w = __fuqueue_first (fuqueue);
+ __fuqueue_waiter_unqueue (w, code);
+ wake_up_process (w->task);
+ }
+ if (likely (w != NULL))
+ wlist_update_prio (&fuqueue->wlist);
+}
+
+
+/** Wake @howmany waiters waiting on a @fuqueue, with return code (for
+ * them) @code */
+static inline
+int fuqueue_wake (struct fuqueue *fuqueue, size_t howmany, int code)
+{
+ unsigned long flags;
+
+ ftrace ("(%p, %zu, %d)\n", fuqueue, howmany, code);
+
+ spin_lock_irqsave (&fuqueue->lock, flags);
+ __fuqueue_wake (fuqueue, howmany, code);
+ spin_unlock_irqrestore (&fuqueue->lock, flags);
+ return 0;
+}
+
+
+/* See docs in kernel/fuqueue.c */
+extern unsigned __fuqueue_op_waiter_cancel (struct fuqueue *,
+ struct fuqueue_waiter *);
+
+
+/** [non irq off/preempt-disabled version */
+extern void __fuqueue_waiter_chprio (struct task_struct *);
+
+/* A waiting @task changed its priority, propagate it. */
+static inline
+void fuqueue_waiter_chprio (struct task_struct *task)
+{
+ unsigned long flags;
+
+ ftrace ("(%p [%d])\n", task, task->pid);
+
+ if (task->fuqueue_wait == NULL)
+ return;
+ local_irq_save (flags);
+ preempt_disable();
+ __fuqueue_waiter_chprio (task);
+ local_irq_restore (flags);
+ preempt_enable();
+}
+
+
+/**
+ * Set the priority of a fuqueue waiter, repositioning it in the wait
+ * list.
+ *
+ * This does not set the prio of the process itself!
+ *
+ * @task: waiting task to reprioritize
+ * @fuqueue: fuqueue the task is waiting for [locked]
+ * @w: waiter @task is waiting with in @fuqueue.
+ * @returns: NULL (as there is no propagation).
+ *
+ * Assumptions: prio != task->prio
+ * fuqueue->lock held
+ */
+static inline
+struct task_struct * __fuqueue_op_waiter_chprio (
+ struct task_struct *task, struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ wlist_chprio (&fuqueue->wlist, &w->wlist_node, task->prio);
+ return NULL;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fuqueue_h__ */
--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/linux/plist.h Thu Apr 8 00:38:53 2004
@@ -0,0 +1,202 @@
+/*
+ * Descending-priority-sorted double-linked list
+ *
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on simple lists (include/linux/list.h).
+ *
+ * The head will always contain the head and the highest priority of
+ * the nodes in the list. Addition is O(N), removal is O(1), change of
+ * priority is O(N)
+ *
+ * 0 is highest priority, INT_MAX is lowest priority.
+ *
+ * No locking is done, up to the caller.
+ *
+ * TODO: O(1)ization, so all operations become O(K), where K is a
+ * constant being the number of priorities used. For this, there
+ * is a list of nodes of different priorities, and each has a
+ * list of all elements of the same priority.
+ */
+
+#ifndef _LINUX_PLIST_H_
+#define _LINUX_PLIST_H_
+
+#include <linux/list.h>
+
+/* Priority-sorted list */
+struct plist {
+ int prio;
+ struct list_head node;
+};
+
+#define plist_INIT(p) \
+{ \
+ .node = LIST_HEAD_INIT((p)->node), \
+ .prio = INT_MAX \
+}
+
+/* Initialize a pl */
+static inline
+void plist_init (struct plist *pl)
+{
+ INIT_LIST_HEAD (&pl->node);
+ pl->prio = INT_MAX;
+}
+
+/* Return the first node (and thus, highest priority)
+ *
+ * Assumes the plist is _not_ empty.
+ */
+static inline
+struct plist * plist_first (struct plist *plist)
+{
+ return list_entry (plist->node.next, struct plist, node);
+}
+
+/* Return if the plist is empty. */
+static inline
+unsigned plist_empty (const struct plist *plist)
+{
+ return list_empty (&plist->node);
+}
+
+/* Update the maximum priority
+ *
+ * Return !0 if the plist's maximum priority changes.
+ *
+ * __plist_update_prio() assumes the plist is not empty.
+ */
+static inline
+unsigned __plist_update_prio (struct plist *plist)
+{
+ unsigned prio = plist_first (plist)->prio;
+ if (plist->prio == prio)
+ return 0;
+ plist->prio = prio;
+ return !0;
+}
+
+static inline
+unsigned plist_update_prio (struct plist *plist)
+{
+ unsigned old_prio = plist->prio;
+ /* plist empty, lowest prio = INT_MAX */
+ plist->prio = plist_empty (plist)? INT_MAX : plist_first (plist)->prio;
+ return old_prio != plist->prio;
+}
+
+/* Add a node to the plist [internal]
+ *
+ * pl->prio == INT_MAX is an special case, means low priority, get down to
+ * the end of the plist. Note the <; we want to add the new guy of a
+ * set of same priority people to the end of that set (FIFO behaviour
+ * on the same priority).
+ */
+static inline
+void __plist_add_sorted (struct plist *plist, struct plist *pl)
+{
+ struct list_head *itr;
+ struct plist *itr_pl;
+
+ if (pl->prio < INT_MAX)
+ list_for_each (itr, &plist->node) {
+ itr_pl = list_entry (itr, struct plist, node);
+ if (pl->prio < itr_pl->prio)
+ break;
+ }
+ else
+ itr = &plist->node;
+ list_add_tail (&pl->node, itr);
+}
+
+/* Add a node to a plist
+ *
+ * Return !0 if the plist's maximum priority changes.
+ */
+static inline
+unsigned plist_add (struct plist *plist, struct plist *pl)
+{
+ __plist_add_sorted (plist, pl);
+ /* Are we setting a higher priority? */
+ if (pl->prio < plist->prio) {
+ plist->prio = pl->prio;
+ return !0;
+ }
+ return 0;
+}
+
+/* Return the priority a pl node */
+static inline
+unsigned plist_prio (struct plist *pl)
+{
+ return pl->prio;
+}
+
+/* Change the priority of a pl node, without updating plist position */
+static inline
+void __plist_chprio (struct plist *pl, unsigned new_prio)
+{
+ pl->prio = new_prio;
+}
+
+/* Change the priority of a pl node updating the list's max priority.
+ *
+ * Return !0 if the plist's maximum priority changes
+ */
+static inline
+unsigned plist_chprio (struct plist *plist, struct plist *pl,
+ unsigned new_prio)
+{
+ __plist_chprio (pl, new_prio);
+ list_del (&pl->node);
+ __plist_add_sorted (plist, pl);
+ return __plist_update_prio (plist);
+}
+
+/* Remove a pl node from a plist
+ *
+ * Return !0 if the plist's maximum priority changed.
+ */
+static inline
+void __plist_rem (struct plist *pl)
+{
+ list_del_init (&pl->node);
+}
+static inline
+unsigned plist_rem (struct plist *plist, struct plist *pl)
+{
+ __plist_rem (pl);
+ return plist_update_prio (plist);
+}
+
+/* Iterate over a plist - in priority order (from high to low) */
+#define plist_for_each(pos, head) \
+for (pos = container_of ((head)->node.next, struct plist, node), \
+ prefetch (pos->node.next); \
+ pos != (head); \
+ pos = container_of (pos->node.next, struct plist, node), \
+ prefetch (pos->node.next))
+
+#define plist_for_each_safe(pos, n, head) \
+ for (pos = container_of ((head)->node.next, struct plist, node), \
+ n = container_of (pos->node.next, struct plist, node); \
+ pos != (head); \
+ pos = n, \
+ n = container_of (pos->node.next, struct plist, node))
+
+
+/** Return !0 if node @pl is the last one on list @plist. */
+
+static inline
+unsigned plist_last (const struct plist *plist, const struct plist *pl)
+{
+ return pl->node.next == &plist->node;
+}
+
+
+#endif /* #ifndef _LINUX_PLIST_H_ */
+
--- /dev/null Thu Apr 15 00:58:26 2004
+++ kernel/fuqueue.c Wed Apr 14 01:48:37 2004
@@ -0,0 +1,209 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * see the doc in linux/fuqueue.h for some info.
+ */
+
+#define DEBUG 8
+
+#include <linux/fuqueue.h>
+#include <linux/sched.h>
+#include <linux/plist.h>
+#include <linux/errno.h>
+
+#if DEBUG > 0
+spinlock_t __debug_lock = SPIN_LOCK_UNLOCKED;
+#endif
+
+
+/**
+ * Wait for a @fuqueue to be woken for as much as @timeout, @returns
+ * wake up code
+ *
+ * WARNING: can only be called from process context
+ */
+int fuqueue_wait (struct fuqueue *fuqueue, signed long timeout)
+{
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p, %ld)\n", fuqueue, timeout);
+
+ spin_lock_irq (&fuqueue->lock);
+ __fuqueue_waiter_queue (fuqueue, &w);
+ return __fuqueue_waiter_block (fuqueue, &w, timeout); /* unlocks */
+}
+
+
+/**
+ * Change the priority of a waiter.
+ *
+ * @task: task whose priority has to be changed.
+ *
+ * This is the entry point that the scheduler functions call when a
+ * task that is waiting for a fuqueue changes its priority. It will
+ * call the fuqueue-specific chprio function after safely determining
+ * what is the fuqueue it is waiting for and then, if the
+ * specific chprio function determines that the prio change has to be
+ * propagated, it will keep doing it.
+ *
+ * The task that the chprio function returns has to be returned with a
+ * get_task_struct() reference.
+ *
+ * Note the weird locking: we are one of the little places that needs
+ * to take the locks in inverse order (most are fuqueue first,
+ * task_wait later--FT), we need to do TF, so we do T, test F, if it
+ * fails, unlock T, try again.
+ */
+void __fuqueue_waiter_chprio (struct task_struct *task)
+{
+ struct fuqueue_ops *ops;
+ struct fuqueue *fuqueue;
+ struct fuqueue_waiter *w;
+
+ ftrace ("(%p [%d])\n", task, task->pid);
+
+ get_task_struct (task);
+next:
+ /* Who is the task waiting for? safely acquire and lock it */
+ if (task->fuqueue_wait == NULL)
+ goto out_task_put;
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ fuqueue = task->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_task_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ goto next;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+
+ w = task->fuqueue_waiter;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+ /* propagate the prio change in a fuqueue-specific fashion */
+ task = ops->waiter_chprio (task, fuqueue, w);
+ if (task == NULL) /* no other task to propagate to? */
+ goto out_fuqueue_unlock;
+ /* We were given a task to propagate to, proceed? */
+ get_task_struct (task);
+ _raw_spin_unlock (&fuqueue->lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ goto next;
+
+out_fuqueue_unlock:
+ _raw_spin_unlock (&fuqueue->lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ goto out;
+
+out_task_unlock:
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+out_task_put:
+ put_task_struct (task);
+out:
+ return;
+}
+
+/** Cancel @task's wait on @fuqueue and update the wait list priority */
+unsigned __fuqueue_op_waiter_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ ftrace ("(%p, %p [%d])\n", fuqueue, w, w->task->pid);
+
+ return wlist_rem (&fuqueue->wlist, &w->wlist_node);
+}
+
+
+/**
+ * Cancel the wait of a task on a fuqueue and wake it up.
+ *
+ * @task: task whose wait is to be canceled
+ *
+ * Called by:
+ * - signal_wake_up()
+ * - process_timeout()
+ * - __wake_up_common()
+ * - FIXME
+ *
+ * when the task they are about to wake is waiting on a
+ * fuqueue. Safely acquires which fuqueue the task is waiting for,
+ * references it, cleans up the task->fuqueue_wait* information, and
+ * then calls the fuqueue specific waiter_cancel() function.
+ *
+ * FIXME: the entry points we get called from don't seem to be all of
+ * them; the perfect thing here would be to hook into
+ * try_to_wake_up()--but is kind of tricky..,
+ *
+ * Note that for this function to actually do anything, the task must
+ * be a task waiting on a fuqueue (so that means TASK_INTERRUPTIBLE
+ * and off the runqueues).
+ */
+void fuqueue_waiter_cancel (struct task_struct *task, int result)
+{
+ struct fuqueue_ops *ops;
+ struct fuqueue *fuqueue;
+ struct fuqueue_waiter *w;
+ unsigned long flags;
+
+ ftrace ("(%p [%d], %d)\n", task, task->pid, result);
+
+ local_irq_save (flags);
+ preempt_disable();
+ get_task_struct (task);
+retry:
+ /* Who is the task waiting for? safely acquire and lock it */
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ fuqueue = task->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_task_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ goto retry;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+
+ w = task->fuqueue_waiter;
+ w->result = result;
+
+ /* Do the specific cancel op */
+ ops->waiter_cancel (fuqueue, w);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter = NULL;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+ _raw_spin_unlock (&fuqueue->lock);
+ local_irq_restore (flags);
+ preempt_enable();
+ if (ops->put)
+ ops->put (fuqueue);
+ return;
+
+out_task_unlock:
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+ local_irq_restore (flags);
+ preempt_enable();
+ return;
+}
+
+
+/** Fuqueue operations for usage within the kernel */
+struct fuqueue_ops fuqueue_ops = {
+ .get = NULL,
+ .put = NULL,
+ .waiter_cancel = __fuqueue_op_waiter_cancel,
+ .waiter_chprio = __fuqueue_op_waiter_chprio
+};

2004-04-21 04:03:49

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 5/11: Support for ppc

arch/ppc/kernel/misc.S | 11 +++++--
include/asm-ppc/fulock.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++
include/asm-ppc/unistd.h | 7 ++++
3 files changed, 82 insertions(+), 4 deletions(-)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/asm-ppc/fulock.h Tue Apr 6 02:30:02 2004
@@ -0,0 +1,68 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_ppc_fulock_h__
+#define __asm_ppc_fulock_h__
+
+/*
+ * fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock.
+ */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_WP = 0xfffffffd, /* kernel controls ownership */
+ VFULOCK_DEAD = 0xfffffffe, /* dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* fulock is not-recoverable */
+};
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned result;
+
+ result = cmpxchg (value, old_value, new_value);
+ return result == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_ppc_fulock_h__ */
--- include/asm-ppc/unistd.h:1.1.1.4 Tue Apr 6 00:22:57 2004
+++ include/asm-ppc/unistd.h Tue Apr 6 02:16:30 2004
@@ -260,8 +260,13 @@
#define __NR_fstatfs64 253
#define __NR_fadvise64_64 254
#define __NR_rtas 255
+#define __NR_ufulock_lock 256
+#define __NR_ufulock_unlock 257
+#define __NR_ufulock_ctl 258
+#define __NR_ufuqueue_wait 259
+#define __NR_ufuqueue_wake 260

-#define __NR_syscalls 256
+#define __NR_syscalls 261

#define __NR(n) #n

--- arch/ppc/kernel/misc.S:1.1.1.5 Tue Apr 6 00:22:00 2004
+++ arch/ppc/kernel/misc.S Tue Apr 6 02:15:36 2004
@@ -292,7 +292,7 @@
* sense anyway.
* -- Cort
*/
- mfmsr r4
+ mfmsr r4
/* Copy all except the MSR_EE bit from r4 (current MSR value)
to r3. This is the sort of thing the rlwimi instruction is
designed for. -- paulus. */
@@ -976,7 +976,7 @@
* R5 has shift count
* result in R3/R4
*
- * ashrdi3: arithmetic right shift (sign propagation)
+ * ashrdi3: arithmetic right shift (sign propagation)
* lshrdi3: logical right shift
* ashldi3: left shift
*/
@@ -1314,7 +1314,7 @@
.long sys_fstat64
.long sys_pciconfig_read
.long sys_pciconfig_write
- .long sys_pciconfig_iobase /* 200 */
+ .long sys_pciconfig_iobase /* 200 */
.long sys_ni_syscall /* 201 - reserved - MacOnLinux - new */
.long sys_getdents64
.long sys_pivot_root
@@ -1370,3 +1370,8 @@
.long sys_fstatfs64
.long ppc_fadvise64_64
.long sys_ni_syscall /* 255 - rtas (used on ppc64) */
+ .long sys_ufulock_lock
+ .long sys_ufulock_unlock
+ .long sys_ufulock_ctl
+ .long sys_ufuqueue_wait
+ .long sys_ufuqueue_wake /* 260 */

2004-04-21 04:12:09

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 9/11: user space fuqueues

ufuqueue.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 236 insertions(+)

--- /dev/null Thu Apr 15 00:58:26 2004
+++ kernel/ufuqueue.c Mon Feb 2 11:30:47 2004
@@ -0,0 +1,236 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/highmem.h> /* kmap_atomic() */
+#include <linux/vlocator.h>
+#include <linux/fuqueue.h>
+
+extern signed long ufu_timespec2jiffies (const struct timespec __user *);
+static struct fuqueue_ops ufuqueue_ops;
+
+/** Slab for allocation of 'struct ufuqueue's. */
+static kmem_cache_t *ufuqueue_slab;
+
+/** A ufuqueue, tied to a user-space vm address. */
+struct ufuqueue {
+ struct fuqueue fuqueue;
+ struct vlocator vlocator;
+};
+
+
+/** Initialize an @ufuqueue (for slab object construction). */
+static inline
+void ufuqueue_ctor (void *obj, kmem_cache_t *slab, unsigned long flags)
+{
+ struct ufuqueue *ufuqueue = obj;
+ __fuqueue_init (&ufuqueue->fuqueue, &ufuqueue_ops);
+ vlocator_init (&ufuqueue->vlocator);
+}
+
+
+/** Allocate an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+struct vlocator * ufuqueue_alloc (void)
+{
+ struct ufuqueue *ufuqueue;
+ ufuqueue = kmem_cache_alloc (ufuqueue_slab, GFP_KERNEL);
+ if (DEBUG > 0 && ufuqueue != NULL)
+ ldebug (1, "ufuqueue %p allocated, total: %d\n",
+ ufuqueue, allocated_inc());
+ return &ufuqueue->vlocator;
+}
+
+
+/**
+ * Create a ufuqueue that is going to be inserted to the hash
+ * [called from vlocator.c:vl_locate() throught the vlocator ops]
+ */
+static
+void ufuqueue_create (struct vlocator *vl, struct page *page,
+ const union futex_key *key, unsigned flags)
+{
+ memcpy (&vl->key, key, sizeof (*key));
+ get_key_refs (&vl->key);
+}
+
+
+/** Free an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+void ufuqueue_release (struct vlocator *vl)
+{
+ drop_key_refs (&vl->key);
+}
+
+
+/** Free an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+void ufuqueue_free (struct vlocator *vl)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (vl, struct ufuqueue, vlocator);
+ kmem_cache_free (ufuqueue_slab, ufuqueue);
+ if (DEBUG > 0 && ufuqueue != NULL)
+ ldebug (1, "ufuqueue %p freed, total: %d\n",
+ ufuqueue, allocated_dec());
+}
+
+
+struct vlocator_ops ufuqueue_vops = {
+ .alloc = ufuqueue_alloc,
+ .create = ufuqueue_create,
+ .release = ufuqueue_release,
+ .free = ufuqueue_free
+};
+
+
+/**
+ * Wait on a fuqueue
+ *
+ * @_vfuqueue: address in user space of the condvar
+ * @cond_flags: flags for the conditional variable
+ * @_vfulock: address in user space of the fulock.
+ * @lock_flags: flags for the fulock.
+ * @_timeout: timeout information [see sys_ufulock_lock()]
+ *
+ * This is just a thin shell that locates the kernel descriptors for
+ * the condvar and the lock and then handles the work to
+ * ufuqueue_wait().
+ */
+asmlinkage
+int sys_ufuqueue_wait (volatile unsigned __user *_vfuqueue,
+ unsigned val, struct timespec __user *_timeout)
+{
+ int result = -EINVAL;
+ struct ufuqueue *ufuqueue;
+ struct page *page;
+ char *page_kmap;
+ volatile unsigned *vfuqueue;
+ unsigned new_val;
+ struct vlocator *vl;
+ signed long timeout;
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p, %x, %p)\n", _vfuqueue, val, _timeout);
+ might_sleep();
+
+ /* ufuqueue: pin pages, get keys, look up/allocate, refcount */
+ result = vl_locate (&page, &vl, &ufuqueue_vops, _vfuqueue, 0);
+ if (unlikely (result < 0))
+ goto out;
+ ufuqueue = container_of (vl, struct ufuqueue, vlocator);
+ /* We are going to lock the ufuqueue, so get the timeout first */
+ timeout = ufu_timespec2jiffies (_timeout);
+ result = (int) timeout;
+ if (timeout < 0)
+ goto out_put_vl;
+
+ spin_lock_irq (&ufuqueue->fuqueue.lock);
+ __fuqueue_waiter_queue (&ufuqueue->fuqueue, &w);
+ /* Now, are we ok with the value? */
+ page_kmap = kmap_atomic (page, KM_IRQ0);
+ vfuqueue = (volatile unsigned *)page_kmap + (vl->key.both.offset & ~1);
+ new_val = *vfuqueue;
+ kunmap_atomic (page_kmap, KM_IRQ0);
+ result = -EWOULDBLOCK;
+ if (val != new_val)
+ goto out_unqueue;
+ /* ok, go ahead and wait (it will unlock and restore irqs/preempt */
+ return __fuqueue_waiter_block (&ufuqueue->fuqueue, &w, timeout);
+
+out_unqueue:
+ __fuqueue_waiter_unqueue (&w, 0);
+ spin_unlock_irq (&ufuqueue->fuqueue.lock);
+out_put_vl:
+ vl_put (vl);
+ put_page (page);
+out:
+ return result;
+}
+
+
+/**
+ * Wake up waiters of a fuqueue
+ *
+ * @_vfuqueue: pointer to the fuqueue
+ * @howmany: number of waiters to wake up
+ * @code: code to return to the waiters [default it to zero]
+ * @returns: 0 if ok, < 0 errno code on error.
+ */
+asmlinkage
+int sys_ufuqueue_wake (volatile unsigned __user *_vfuqueue,
+ size_t howmany, int code)
+{
+ int result;
+ struct vlocator *vl;
+ struct ufuqueue *ufuqueue;
+
+ ftrace ("(%p, %u)\n", _vfuqueue, howmany);
+ might_sleep();
+
+ /* ufuqueue: get keys, look up (don't allocate), refcount */
+ result = vl_locate (NULL, &vl, NULL, _vfuqueue, 0);
+ if (result < 0)
+ goto out;
+ ufuqueue = container_of (vl, struct ufuqueue, vlocator);
+ /* Wake'en up */
+ spin_lock_irq (&ufuqueue->fuqueue.lock);
+ __fuqueue_wake (&ufuqueue->fuqueue, howmany, code);
+ spin_unlock_irq (&ufuqueue->fuqueue.lock);
+ vl_put (vl);
+ result = 0;
+out:
+ return result;
+}
+
+
+/** Initialize the ufuqueue subsystem. */
+static
+int __init subsys_ufuqueue_init (void)
+{
+ ufuqueue_slab = kmem_cache_create ("ufuqueue", sizeof (struct ufuqueue),
+ 0, 0, ufuqueue_ctor, NULL);
+ if (ufuqueue_slab == NULL)
+ panic ("ufuqueue_init(): "
+ "Unable to initialize ufuqueue slab allocator.\n");
+ return 0;
+}
+__initcall (subsys_ufuqueue_init);
+
+
+/* Adaptors for fulock operations */
+static
+void ufuqueue_put (struct fuqueue *fuqueue)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (fuqueue, struct ufuqueue, fuqueue);
+ vl_put (&ufuqueue->vlocator);
+}
+
+static
+void ufuqueue_get (struct fuqueue *fuqueue)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (fuqueue, struct ufuqueue, fuqueue);
+ vl_get (&ufuqueue->vlocator);
+}
+
+/** Ufuqueue operations */
+static
+struct fuqueue_ops ufuqueue_ops = {
+ .get = ufuqueue_get,
+ .put = ufuqueue_put,
+ .waiter_cancel = __fuqueue_op_waiter_cancel,
+ .waiter_chprio = __fuqueue_op_waiter_chprio
+};
+

2004-04-21 04:08:56

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 8/11: user space/kernel space tracker

include/linux/vlocator.h | 128 ++++++++++++
kernel/vlocator.c | 465 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 593 insertions(+)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/linux/vlocator.h Mon Jan 5 10:23:13 2004
@@ -0,0 +1,128 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Stuff to map user space addresses to kernel space objects in a
+ * controlled way.
+ */
+
+#ifndef __linux_vlocator_h__
+#define __linux_vlocator_h__
+
+#ifdef __KERNEL__
+
+#include <linux/list.h>
+#include <linux/hash.h>
+#include <asm/atomic.h>
+#include <linux/futex.h> /* for the futex hash */
+
+
+/** Track @what by a user level vm address. */
+struct vlocator {
+ atomic_t refcount;
+ struct list_head hash_list;
+ union futex_key key;
+ const struct vlocator_ops *ops;
+};
+
+static __inline__
+void vlocator_init (struct vlocator *vlocator)
+{
+ atomic_set (&vlocator->refcount, 0);
+ INIT_LIST_HEAD (&vlocator->hash_list);
+}
+
+
+/** A single queue in the hash. */
+struct vlocator_queue
+{
+ spinlock_t lock;
+ struct list_head queue;
+ unsigned additions;
+};
+
+static __inline__
+void vlocator_queue_init (struct vlocator_queue *vlq)
+{
+ vlq->lock = SPIN_LOCK_UNLOCKED;
+ INIT_LIST_HEAD (&vlq->queue);
+ vlq->additions = 0;
+}
+
+struct page;
+union futex_key;
+
+/** vlocator operations on the items that are to be located. */
+struct vlocator_ops
+{
+ struct vlocator * (* alloc) (void);
+ void (* create) (struct vlocator *, struct page *,
+ const union futex_key *, unsigned flags);
+ void (* release) (struct vlocator *);
+ void (* free) (struct vlocator *);
+};
+
+
+/** Reference an vlocator @vlocator. */
+#define vl_get(vlocator) \
+do { atomic_inc (&(vlocator)->refcount); } while (0)
+
+
+/** Unreference an @vlocator; return true if it drops to zero. */
+static inline
+unsigned vl_put (struct vlocator *vlocator)
+{
+ return atomic_dec_and_test (&vlocator->refcount);
+}
+
+extern int vl_find (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr);
+extern int vl_locate (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr, unsigned flags);
+extern int vl_dispose (const struct vlocator_ops *,
+ volatile const unsigned __user *);
+
+/* Debug stuff */
+
+/*
+ * Helpers for debugging memory allocation [needs to be here so it can
+ * be shared by ufuqueue.c and ufulock.c].
+ */
+
+#if DEBUG >= 1
+static atomic_t allocated_count = ATOMIC_INIT(0);
+static inline
+int allocated_get (void)
+{
+ return atomic_read (&allocated_count);
+}
+static inline
+int allocated_inc (void)
+{
+ atomic_inc (&allocated_count);
+ return allocated_get();
+}
+static inline
+int allocated_dec (void)
+{
+ atomic_dec (&allocated_count);
+ return allocated_get();
+}
+#else
+static inline int allocated_get (void) { return 0; }
+static inline int allocated_inc (void) { return 0; }
+static inline int allocated_dec (void) { return 0; }
+#endif /* DEBUG > 1 */
+
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_vlocator_h__ */
--- /dev/null Thu Apr 15 00:58:26 2004
+++ kernel/vlocator.c Tue Mar 23 23:03:32 2004
@@ -0,0 +1,465 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/futex.h>
+#include <asm/uaccess.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+#include <linux/fuqueue.h> /* For the debug statements */
+#include <linux/vlocator.h>
+
+#define VL_GC_PERIOD 20 /* do garbage collection every 20 seconds */
+#define VL_HASH_BITS 8
+#define VL_HASH_QUEUES (1 << VL_HASH_BITS)
+static struct vlocator_queue __vl_hash[VL_HASH_QUEUES];
+static struct list_head __vl_disposal = LIST_HEAD_INIT (__vl_disposal);
+static spinlock_t __vl_disposal_lock = SPIN_LOCK_UNLOCKED;
+
+/** Get the hash index for futex_key @k. */
+static __inline__
+struct vlocator_queue * vl_hash (const union futex_key *k)
+{
+ unsigned index = futex_hash_key (k) & (VL_HASH_QUEUES - 1);
+ return &__vl_hash[index];
+}
+
+
+/** Search in a queue. */
+static __inline__
+struct vlocator * __vlocator_find (struct vlocator_queue *vlq,
+ const union futex_key *key)
+{
+ struct vlocator *vl = NULL;
+ struct list_head *itr;
+
+ list_for_each (itr, &vlq->queue) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (match_futex_key (key, &vl->key))
+ goto out;
+ }
+ vl = NULL;
+out:
+ return vl;
+}
+
+
+/** Search in a queue [backwards - locates faster recent additions]. */
+static inline
+struct vlocator * __vlocator_find_r (struct vlocator_queue *vlq,
+ const union futex_key *key)
+{
+ struct vlocator *vl = NULL;
+ struct list_head *itr;
+
+ list_for_each_prev (itr, &vlq->queue) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (match_futex_key (key, &vl->key))
+ goto out;
+ }
+ vl = NULL;
+out:
+ return vl;
+}
+
+
+/** Remove @vlocator from queue @vlq if its use count is zero (and
+ * return true). */
+static __inline__
+unsigned __vlocator_rem (struct vlocator *v)
+{
+ unsigned result = 0, refcount;
+ refcount = atomic_read (&v->refcount);
+ if (refcount == 0) {
+ list_del (&v->hash_list);
+ result = 1;
+ }
+ return result;
+}
+
+
+/** Append @vlocator to queue @vlq. */
+static __inline__
+void __vlocator_add (struct vlocator_queue *vlq, struct vlocator *v)
+{
+ vlq->additions++;
+ list_add (&v->hash_list, &vlq->queue);
+}
+
+
+/**
+ * Setup all the memory mumbo jumbo we need to access a user space
+ * item: locate the page where the address is and generate a futex_key
+ * for it. Return both referenced [with pin_page() and get_key_refs()]
+ *
+ * Shamelessly copied from Jamie's kernel/futex.c:get_futex_key()...
+ * good invention this GPL thingie :)
+ */
+int vl_setup (struct page **ppage, union futex_key *key,
+ volatile const unsigned __user *_uaddr)
+{
+ int result;
+ struct page *page;
+ unsigned long uaddr = (unsigned long) _uaddr;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ ftrace ("(%p, %p, %p)\n", ppage, key, _uaddr);
+
+ /* The futex address must be "naturally" aligned. */
+ key->both.offset = uaddr % PAGE_SIZE;
+ if (unlikely ((key->both.offset % sizeof (u32)) != 0))
+ return -EINVAL;
+ uaddr -= key->both.offset;
+
+ /* Generate a key for the vfulock. */
+ down_read (&mm->mmap_sem);
+ /*
+ * The futex is hashed differently depending on whether
+ * it's in a shared or private mapping. So check vma first.
+ */
+ vma = find_extend_vma (mm, uaddr);
+ result = -EFAULT;
+ if (unlikely (!vma))
+ goto error_up;
+ /* Permissions. */
+ result = (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
+ if (unlikely ((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
+ goto error_up;
+ /*
+ * Private mappings are handled in a simple way.
+ *
+ * NOTE: When userspace waits on a MAP_SHARED mapping, even if
+ * it's a read-only handle, it's expected that futexes attach to
+ * the object not the particular process. Therefore we use
+ * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
+ * mappings of _writable_ handles.
+ */
+ result = 0;
+ if (likely (!(vma->vm_flags & VM_MAYSHARE))) {
+ key->private.mm = mm;
+ key->private.uaddr = uaddr;
+ goto out_pin;
+ }
+ /* Linear file mappings are also simple. */
+ key->shared.inode = vma->vm_file->f_dentry->d_inode;
+ key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ if (likely (!(vma->vm_flags & VM_NONLINEAR))) {
+ key->shared.pgoff = (((uaddr - vma->vm_start) >> PAGE_SHIFT)
+ + vma->vm_pgoff);
+ goto out_pin;
+ }
+
+ /*
+ * We really need to pin the page in memory, we are about to
+ * modify it.
+ */
+
+ /*
+ * Do a quick atomic lookup first - this is the fastpath.
+ */
+#warning FIXME: OPTIMIZE:
+ /*
+ * We only need to take this path when we don't know the page;
+ * we can hack vl_locate() so that if we know the key without
+ * checking the page and we find the vlocator, we can take the
+ * page from there.
+ */
+out_pin:
+ spin_lock (&mm->page_table_lock);
+ page = follow_page (mm, uaddr, 0);
+ if (likely (page != NULL)) {
+ key->shared.pgoff =
+ page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ if (!PageReserved (page))
+ get_page (page);
+ spin_unlock (&mm->page_table_lock);
+ goto out;
+ }
+ spin_unlock (&mm->page_table_lock);
+
+ /*
+ * Do it the general way.
+ */
+ result = get_user_pages (current, mm, uaddr, 1, 0, 0, &page, NULL);
+ if (result < 0)
+ goto error_up;
+ key->shared.pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+out:
+ get_key_refs (key);
+ *ppage = page;
+error_up:
+ up_read (&current->mm->mmap_sem);
+ return result;
+}
+
+
+/**
+ * Locate a vlocator for a certain user address.
+ *
+ * @ppage: Pointer to where to store the pointer to the referenced
+ * page where @uaddr is. NULL if unneeded.
+ * @pvl: Pointer to where to store the pointer to the located
+ * vlocator based structure.
+ * @ops: Pointer to the function ops for this guy (for checking)
+ * @uaddr: Address in user space.
+ * @returns: < 0 errno code on error (no resources held)
+ * 0 if ok and:
+ *
+ * - Located vlocator in *pvl, referenced
+ * - Page pointer in *ppage (unless NULL), referenced
+ * - The key in (*pvl)->key has been referenced with get_key_refs()
+ */
+int vl_find (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr)
+{
+ int result;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl;
+ union futex_key key;
+ struct page *page;
+ unsigned need_page = 0;
+
+ ftrace ("(%p, %p, %p, %p)\n", ppage, pvl, ops, uaddr);
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ /* Is it in the hash already? [do we know about it?] */
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (likely (vl == NULL))
+ goto out_unlock;
+ /* Check the type is correct */
+ result = -EFAULT;
+ if (unlikely (vl->ops != ops))
+ goto out_unlock;
+ result = 0;
+ vl_get (vl);
+ *pvl = vl;
+ if (ppage) {
+ *ppage = page;
+ need_page = 1;
+ }
+out_unlock:
+ spin_unlock (&vlq->lock);
+ drop_key_refs (&key); // Undo from ...
+ if (need_page == 0)
+ put_page (page); // ... ufu_setup() (maybe)
+error:
+ return result;
+}
+
+
+/**
+ * Locate or allocate a vlocator for a certain user address.
+ *
+ * @ppage: Pointer to where to store the pointer to the referenced
+ * page where @uaddr is. NULL if unneeded.
+ * @pvl: Pointer to where to store the pointer to the allocated
+ * vlocator based structure.
+ * @ops: Pointer to the function ops used for allocating/constructing, etc.
+ * @uaddr: Address in user space.
+ * @flags: flags, for initialization, passed to ops->create()
+ * @returns: < 0 errno code on error (no resources held)
+ * 0 if ok and:
+ *
+ * - Located or allocated vlocator in *pvl, referenced
+ * - Page pointer in *ppage (unless NULL), referenced
+ * - The key in (*pvl)->key has been referenced with get_key_refs()
+ *
+ * FIXME: optimize allocation collision detection; add a modification
+ * count to the hash table; whenever anything is added, inc the
+ * count. If we see the count changed after we droped the lock,
+ * allocated and re-locked, then we need to find_r(), if not, we can
+ * assume nothing changed.
+ */
+int vl_locate (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr, unsigned flags)
+{
+ int result;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl, *vl_alt;
+ union futex_key key;
+ struct page *page;
+ unsigned need_page = 0;
+ unsigned additions0;
+
+ ftrace ("(%p, %p, %p, %p, %x)\n", ppage, pvl, ops, uaddr, flags);
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ /* Is it in the hash already? [do we know about it?] */
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (likely (vl != NULL))
+ goto out_check;
+ additions0 = vlq->additions;
+ spin_unlock (&vlq->lock);
+
+ /* Naaah, let's alloc it */
+ result = -ENOMEM;
+ vl = ops->alloc();
+ if (unlikely (vl == NULL))
+ goto error_alloc;
+ result = 0;
+
+ /* Ok, allocated - did somebody add it while we were allocating?
+ * [we try to speed up by checking if anybody has added since
+ * we dropped the lock--we live up with the chance of 4G
+ * allocations wrapping up the counter in the middle *grin*]. */
+ spin_lock (&vlq->lock);
+ if (additions0 == vlq->additions
+ || (vl_alt = __vlocator_find_r (vlq, &key)) == NULL) {
+ ops->create (vl, page, &key, flags);
+ vl->ops = ops;
+ __vlocator_add (vlq, vl);
+ goto out_ref;
+ }
+ /* Allocation collision, get the new one, discard ours */
+ ops->free (vl);
+ vl = vl_alt;
+out_check:
+ result = -EFAULT;
+ if (unlikely (vl->ops != ops)) /* Check the type is correct */
+ goto out_unlock;
+ result = 0;
+out_ref:
+ vl_get (vl);
+ *pvl = vl;
+ if (ppage) {
+ *ppage = page;
+ need_page = 1;
+ }
+out_unlock:
+ spin_unlock (&vlq->lock);
+error_alloc:
+ drop_key_refs (&key); // Undo from ...
+ if (need_page == 0)
+ put_page (page); // ... ufu_setup() (maybe)
+error:
+ return result;
+}
+
+
+/**
+ * Dispose a vlocator
+ *
+ * When a vlocator is no longer needed, remove it from the hash table
+ * (add it to a delete list so the garbage collector can properly get
+ * rid of it).
+ */
+int vl_dispose (const struct vlocator_ops *vops,
+ volatile const unsigned __user *uaddr)
+{
+ int result = -ENXIO;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl;
+ struct page *page;
+ union futex_key key;
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (vl == NULL)
+ goto out_unlock;
+ /* In the hash, move it out */
+ result = 0;
+ list_del (&vl->hash_list);
+ spin_lock (&__vl_disposal_lock);
+ list_add_tail (&vl->hash_list, &__vl_disposal);
+ spin_unlock (&__vl_disposal_lock);
+out_unlock:
+ spin_unlock (&vlq->lock);
+ put_page (page);
+ drop_key_refs (&key);
+error:
+ return result;
+}
+
+
+/** Work structure for the garbage collector */
+static void vl_garbage_collector (void *);
+DECLARE_WORK(vl_workqueue, vl_garbage_collector, NULL);
+
+
+/**
+ * Do garbage collection (called from the work-queue) and re-arm
+ *
+ * The most important action is to cleanup the ownership of [u]fulocks
+ * that were assigned to init because the owner died and robustness
+ * was not enabled. This means that only ufulocks get to optionally
+ * suport robustness; kernel based fulocks have to do robustness
+ * always.
+ */
+
+static inline
+void __vl_garbage_collect (struct list_head *list)
+{
+ struct list_head *itr, *nxt;
+ struct vlocator *vl;
+ list_for_each_safe (itr, nxt, list) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (__vlocator_rem (vl)) {
+ vl->ops->release (vl);
+ vl->ops->free (vl);
+ }
+ }
+}
+
+static
+void vl_garbage_collector (void *dummy)
+{
+ unsigned cnt;
+ struct vlocator_queue *vlq;
+
+ /* Cleanup the vlocator hash: anything with a zero refcount is wiped */
+ for (cnt = 0; cnt < VL_HASH_QUEUES; cnt++) {
+ vlq = &__vl_hash[cnt];
+ if (list_empty (&vlq->queue)) /* Some cheating always helps... */
+ continue;
+ spin_lock (&vlq->lock);
+ __vl_garbage_collect (&vlq->queue);
+ spin_unlock (&vlq->lock);
+ }
+ /* Cleanup the list of stuff marked for disposal */
+ spin_lock (&__vl_disposal_lock);
+ __vl_garbage_collect (&__vl_disposal);
+ spin_unlock (&__vl_disposal_lock);
+ /* Re-arm */
+ schedule_delayed_work (&vl_workqueue, VL_GC_PERIOD * HZ);
+}
+
+
+/** Initialize the ufu subsystem. */
+static
+int __init subsys_ufu_init (void)
+{
+ unsigned i;
+ for (i = 0; i < sizeof (__vl_hash) / sizeof (__vl_hash[0]); i++)
+ vlocator_queue_init (&__vl_hash[i]);
+ /* Set up the garbage collector to run every 10 seconds */
+ schedule_delayed_work (&vl_workqueue, VL_GC_PERIOD * HZ);
+ return 0;
+}
+__initcall (subsys_ufu_init);
+
+

2004-04-21 04:15:14

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 10/11: kernel fulocks

include/linux/fulock-olist.h | 55 +++
include/linux/fulock.h | 278 +++++++++++++++++++
kernel/fulock.c | 605 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 938 insertions(+)

--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/linux/fulock-olist.h Wed Jan 28 23:25:36 2004
@@ -0,0 +1,55 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Fulock Ownership List
+ *
+ * This file is here to avoid include hell, as sched.h needs it to be
+ * able to embed a fulock ownership list into 'struct
+ * task_struct'. Its sole users are sched.h and fulock.h.
+ *
+ * This is just a redirection of the methods used to manage the list
+ * of fulocks that a task owns in a given moment.
+ *
+ * I don't think anybody in its right mind wants to use a
+ * priority-array list for this, but just in case, and to ease the
+ * change of the implementation, I was redirecting it.
+ *
+ * I will pull the plug on this one, search and replace all the
+ * olist_*() for plist_*() and wipe this file out as soon as I have a
+ * minute. I consider this low prio.
+ */
+
+#ifndef __linux_fulock_olist_h__
+#define __linux_fulock_olist_h__
+
+#ifdef __KERNEL__
+
+#include <linux/plist.h>
+
+typedef struct plist olist_t;
+
+#define olist_add plist_add
+#define olist_rem plist_rem
+#define olist_init plist_init
+#define olist_INIT(plist) plist_INIT(plist)
+#define olist_for_each plist_for_each
+#define olist_for_each_safe plist_for_each_safe
+#define olist_empty plist_empty
+#define olist_first plist_first
+#define olist_prio plist_prio
+#define olist_chprio plist_chprio
+#define __olist_chprio __plist_chprio
+#if 0
+#define olist_update_prio plist_update_prio
+#endif
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fulock_olist_h__ */
--- /dev/null Thu Apr 15 00:58:25 2004
+++ include/linux/fulock.h Thu Apr 8 00:14:02 2004
@@ -0,0 +1,278 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __linux_fulock_h__
+#define __linux_fulock_h__
+
+#include <asm/fulock.h>
+
+/**
+ * User space provided fulock flags
+ *
+ * fulocks/ufulocks are *always* robust, it is up to the caller to
+ * emulate a hang if a robust behavior is not desired. However, the
+ * NOT_RM (not robust mutex) flag is kept and checked on exit and if
+ * there are owned fulocks, a warning will be printed.
+ */
+enum {
+ FULOCK_FL_USER_MK = 0xfffff000, /* Flags provided by user space */
+ FULOCK_FL_PI = 0x80000000, /* Priority inherit */
+ FULOCK_FL_PP = 0x40000000, /* Priority protected */
+ FULOCK_FL_RM_SUN = 0x20000000, /* Robust mutex (sun style) */
+ FULOCK_FL_RM = 0x10000000, /* Robust mutex [warns only] */
+ FULOCK_FL_KCO = 0x08000000, /* No fast path, KCO mode always */
+ FULOCK_FL_ERROR_CHK = 0x04000000, /* Perform POSIX error checks */
+/* Priority ceiling masks */
+ FULOCK_FL_PP_PLC_MK = 0x00f00000, /* Policy */
+ FULOCK_FL_PP_PRIO_MK = 0x000ff000 /* Priority */
+};
+
+/** Fulock health state */
+enum fulock_st {
+ fulock_st_healthy, /* Normal, healthy */
+ fulock_st_dead, /* Some previous owner died */
+ fulock_st_nr, /* idem, plus cannot be recovered */
+ fulock_st_init /* fulock was re-initialized */
+};
+
+
+/** Fulock control actions */
+enum fulock_ctl {
+ fulock_ctl_nop = 0, /* No-op, just return actual health state */
+ fulock_ctl_init, /* Move from any state to initialized */
+ fulock_ctl_heal, /* Move from dead to healthy */
+ fulock_ctl_nr, /* Move from dead to not-recoverable */
+ fulock_ctl_destroy, /* Tell the kernel to forget about it */
+ fulock_ctl_waiters, /* Do we have waiters? */
+ fulock_ctl_locked, /* Is it locked? */
+ fulock_ctl_setprioceil /* Set the priority ceiling */
+};
+
+
+#ifdef __KERNEL__
+
+#include <linux/fuqueue.h>
+#include <linux/plist.h>
+#include <linux/vlocator.h>
+#include <asm/errno.h>
+
+/* Internal fulock flags */
+enum {
+ FULOCK_FL_NR = 0x00000100, /* not recoverable */
+ FULOCK_FL_DEAD = 0x00000200, /* dead-owner */
+ FULOCK_FL_NEEDS_SYNC = 0x00000800, /* Need sync from user space */
+ FULOCK_FL_PP_MK = 0x000000ff /* raw priority ceiling */
+};
+
+struct fulock;
+
+/** Operations on a fulock. */
+struct fulock_ops
+{
+ struct fuqueue_ops fuqueue;
+ unsigned (* owner_set) (struct fulock *, struct task_struct *);
+ unsigned (* owner_reset) (struct fulock *);
+ void (* exit) (struct fulock *);
+};
+
+/* In-kernel fulock operations */
+extern struct fulock_ops fulock_ops;
+extern struct fulock_ops ufulock_ops;
+extern struct vlocator_ops ufulock_vops;
+
+
+/** A fulock, mutex usable from the kernel. */
+struct fulock {
+ struct fuqueue fuqueue;
+ struct task_struct *owner;
+ unsigned flags;
+ olist_t olist_node;
+};
+
+
+/** Initialize a @fulock with given ops */
+static __inline__
+void __fulock_init (struct fulock *fulock, struct fulock_ops *ops)
+{
+ __fuqueue_init (&fulock->fuqueue, &ops->fuqueue);
+ fulock->owner = NULL;
+ fulock->flags = 0;
+ olist_init (&fulock->olist_node);
+}
+
+/** Statically initialize a @fulock with given ops */
+#define __fulock_INIT(fulock, ops) { \
+ .fuqueue = __fuqueue_INIT (&(fulock)->fuqueue, \
+ &(ops)->fuqueue), \
+ .owner = NULL, \
+ .flags = 0, \
+ .olist_node = olist_INIT (&(fulock)->olist_node) \
+}
+
+/** Initialize a @fulock for usage within the kernel */
+static __inline__
+void fulock_init (struct fulock *fulock)
+{
+ __fulock_init (fulock, &fulock_ops);
+}
+
+/** Statically initialize a @fulock for usage within the kernel */
+#define fulock_INIT(fulock, ops) __fulock_INIT (fulock, &kernel_ops)
+
+
+/* Primitives for locking/unlocking/waiting/signalling */
+extern int __fulock_lock (struct fulock *, signed long);
+extern int __fulock_unlock (struct fulock *, size_t, int);
+extern int __fulock_ctl (struct fulock *fulock, enum fulock_ctl ctl);
+extern void __fulock_op_exit (struct fulock *);
+extern void exit_fulocks (struct task_struct *task);
+
+extern int __fulock_check_deadlock (struct task_struct *, struct fulock *);
+
+/** More internal stuff for building on top of fulock */
+extern unsigned __fulock_op_waiter_cancel (struct fuqueue *,
+ struct fuqueue_waiter *);
+extern struct task_struct * __fulock_op_waiter_chprio (
+ struct task_struct *, struct fuqueue *, struct fuqueue_waiter *);
+
+
+/**
+ * Lock a fulock, maybe wait for it to be available
+ *
+ * @timeout: wait to acquire the fulock as much @timeout jiffies. If
+ * zero, don't block, tryonly. If MAX_SCHEDULE_TIMEOUT,
+ * block indefinitely until the lock is acquired.
+ *
+ * @returns: See __fulock_lock().
+ *
+ * Can ONLY be called from process context. Note __fulock_lock()
+ * unlocks the fulock on return and re-enables IRQs and preemption.
+ */
+static __inline__
+int fulock_lock (struct fulock *fulock, signed timeout)
+{
+ int result;
+ do {
+ spin_lock_irq (&fulock->fuqueue.lock);
+ result = __fulock_lock (fulock, timeout);
+ } while (result == -EAGAIN);
+ return result;
+}
+
+
+/**
+ * Unlock a fulock, wake up waiter(s)
+ *
+ * @f: fulock.
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * futex, so that no other task (in user or kernel space)
+ * can acquire it.
+ * @returns: Number of tasks woken up, < 0 errno code on error.
+ *
+ * Can be called from any context [I hope].
+ */
+static __inline__
+int fulock_unlock (struct fulock *fulock, size_t howmany)
+{
+ int result;
+ unsigned long flags;
+
+ ftrace ("(%p, %zu)\n", fulock, howmany);
+
+ spin_lock_irqsave (&fulock->fuqueue.lock, flags);
+ result = __fulock_unlock (fulock, howmany, 0);
+ spin_unlock_irqrestore (&fulock->fuqueue.lock, flags);
+ return result;
+}
+
+
+/**
+ * Manipulate the state of a fulock
+ *
+ * @f: fulock to set
+ * @ctl: Control command to modify the fulock's state.
+ * @returns: 'enum fulock_st' health state; < 0 errno code on
+ * error.
+ *
+ * FIXME: this function set is kind of too convoluted, I am afraid.
+ *
+ * Can be called from only from process context, as it checks for
+ * current being the current owner.
+ */
+static __inline__
+int fulock_ctl (struct fulock *fulock, enum fulock_ctl ctl)
+{
+ int result;
+ unsigned long flags;
+ ftrace ("(%p, %d)\n", fulock, ctl);
+
+ spin_lock_irqsave (&fulock->fuqueue.lock, flags);
+ result = __fulock_ctl (fulock, ctl);
+ spin_unlock_irqrestore (&fulock->fuqueue.lock, flags);
+ return result;
+}
+
+
+/** A ufulock, tied to a user-space vm address. */
+struct ufulock {
+ struct fulock fulock;
+ struct vlocator vlocator;
+ struct page *page;
+};
+
+
+/** @Return true if the fulock @f has no waiters. */
+static __inline__
+unsigned __fulock_empty (const struct fulock *f)
+{
+ return __fuqueue_empty (&f->fuqueue);
+}
+
+
+/** [Internal] Make task @task the owner of fulock @f. */
+static __inline__
+unsigned __fulock_op_owner_set (struct fulock *fulock,
+ struct task_struct *task)
+{
+ unsigned prio_changed;
+
+ ftrace ("(%p, %p [%d])\n", fulock, task, task->pid);
+ CHECK_IRQs();
+
+ fulock->owner = task;
+ _raw_spin_lock (&task->fulock_olist_lock);
+ prio_changed = olist_add (&task->fulock_olist, &fulock->olist_node);
+ _raw_spin_unlock (&task->fulock_olist_lock);
+ return prio_changed;
+}
+
+
+/** [Internal] Reset ownership of fulock @f. */
+static __inline__
+unsigned __fulock_op_owner_reset (struct fulock *fulock)
+{
+ unsigned prio_changed;
+ struct task_struct *owner = fulock->owner;
+ ftrace ("(%p)\n", fulock);
+ CHECK_IRQs();
+
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ prio_changed = olist_rem (&owner->fulock_olist, &fulock->olist_node);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ fulock->owner = NULL;
+ return prio_changed;
+}
+
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fulock_h__ */
--- /dev/null Thu Apr 15 00:58:25 2004
+++ kernel/fulock.c Wed Apr 14 01:48:24 2004
@@ -0,0 +1,605 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/fulock.h>
+#include <linux/plist.h>
+#include <linux/time.h> /* struct timespec */
+#include <linux/sched.h> /* MAX_SCHEDULE_TIMEOUT */
+#include <linux/errno.h>
+
+#warning FIXME: make the fulock in-kernel interface more clear
+#warning TODO: FIFO mode for the waiters list
+
+
+/**
+ * A waiter on this fulock was added, removed or reprioritized and
+ * that caused the wlist priority to change, so we must update the
+ * fulock priority in the owner's ownership list.
+ *
+ * @fulock: The defendant [nossir, I haven't been groklawing].
+ * @returns: !0 f the boost priority for the owner was changed.
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+static
+unsigned __fulock_prio_update (struct fulock *fulock)
+{
+ unsigned prio_changed;
+ struct task_struct *owner;
+ int prio;
+ unsigned result = 0;
+
+ ftrace ("(%p)\n", fulock);
+
+ owner = fulock->owner;
+ prio = wlist_prio (&fulock->fuqueue.wlist);
+ if (owner == NULL) {
+ prio_changed = olist_prio (&fulock->olist_node) != prio;
+ __olist_chprio (&fulock->olist_node, prio);
+ goto out;
+ }
+ /* reposition this fulock on it's owner's ownership list */
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ prio_changed = olist_chprio (&owner->fulock_olist,
+ &fulock->olist_node, prio);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ /* there is a new maximum on the list? maybe chprio the owner */
+ ldebug (2, "prio boosting %d to prio %d (current %d)\n",
+ owner->pid, prio, owner->prio);
+ if (prio_changed)
+ result = __prio_boost (owner, prio);
+out:
+ ldebug (2, "fulock %p, updated prio to %d (changed %d), "
+ "propagating to owner, pid %d (changed %d)\n",
+ fulock, prio, prio_changed, owner->pid, result);
+ return result;
+}
+
+
+
+
+/**
+ * Test if a to-be queued task would deadlock if it waits for a fulock.
+ *
+ * Simple as it is, it looks ugly as hell. Basically, the fulock we
+ * are about to lock will have an owner, so we check the owner; if it
+ * is us, deadlock, if not, we see if the fulock is waiting for
+ * anything; if so, we check it is a fulock, and if so, who's the
+ * owner; if it is us, then deadlock, if not, start again ...
+ *
+ * Now, the trick is to keep the reference counts, and the locks and
+ * all the crap. A single lock for everything would be *so* beautiful
+ * (and not scalable :).
+ *
+ * @task: the task to check
+ * @fulock: the fulock to check
+ * @returns: 0 if ok, < 0 errno on error.
+ *
+ * This *should* be safe being lock-less [famous last words].
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+int __fulock_check_deadlock (struct task_struct *task, struct fulock *fulock)
+{
+ int result = 0;
+ struct fuqueue_ops *ops;
+ struct task_struct *owner;
+ struct fuqueue *fuqueue;
+
+ ftrace ("(%p, %p)\n", task, fulock);
+
+ /* first fulock to trace is already locked and we won't unlock it */
+ owner = fulock->owner;
+ if (owner == NULL)
+ goto out;
+ result = -EDEADLK;
+ if (owner == task)
+ goto out;
+ get_task_struct (owner);
+next:
+ result = 0;
+ /* Who is the owner waiting for? safely acquire and lock it */
+ _raw_spin_lock (&owner->fuqueue_wait_lock);
+ fuqueue = owner->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_owner_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ goto next;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ put_task_struct (owner);
+
+ /* Is a fulock whatever the owner is waiting for? */
+ if (ops != &fulock_ops.fuqueue && ops != &ufulock_ops.fuqueue)
+ goto out_fuqueue_unlock;
+ fulock = container_of (fuqueue, struct fulock, fuqueue);
+ owner = fulock->owner; /* Who's the fulock's owner? */
+ if (unlikely (owner == NULL)) /* Released before we locked it? */
+ goto out_fuqueue_unlock;
+ result = -EDEADLK; /* It is us? ooops */
+ if (owner == task)
+ goto out_fuqueue_unlock;
+
+ /* What's the owner waiting for? Proceed to it */
+ get_task_struct (owner);
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ goto next;
+
+out_owner_unlock:
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ put_task_struct (owner);
+ return result;
+
+out_fuqueue_unlock:
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ if (ops->put)
+ ops->put (fuqueue);
+out:
+ return result;
+}
+
+
+/**
+ * Check if a process is allowed to lock a PP fulock.
+ *
+ * @fulock: fulock to check on.
+ * @returns: 0 if ok, errno code on error (disallowed)
+ */
+int __fulock_pp_allowed (struct fulock *fulock)
+{
+#warning FIXME: finish _pp_allowed()
+#if 0
+ int policy = fulock->flags & FULOCK_FL_PP_PLC_MK >> 20;
+ int priority = fulock->flags & FULOCK_FL_PP_PRIO_MK >> 12;
+ int prio;
+
+ if (policy != SCHED_NORMAL)
+ prio = MAX_USER_RT_PRIO - 1 - priority;
+ else
+ prio = priority;
+ fulock->flags &= ~FULOCK_FL_PP_PRIO_MK;
+ fulock->flags |= prio & FULOCK_FL_PP_PRIO_MK;
+#warning FIXME: interaction with PI? Compare against static, not dynamic?
+ if (prio > current->prio)
+ return -EINVAL;
+#endif
+ return 0;
+}
+
+
+/**
+ * [Try]lock a fulock
+ *
+ * @f: fulock to [try]lock.
+ * @timeout: Time to wait for the lock to be acquired. If 0, trylock
+ * only, if MAX_SCHEDULE_TIMEOUT, wait for ever, else, wait
+ * the given time.
+ * @returns: 0 Acquired the fulock
+ * -EOWNERDEAD Acquired the fulock, some previous owner died.
+ * -ENOTRECOVERABLE Not acquired, fulock is not recoverable
+ * -EBUSY Not acquired, it is locked [and trylock was
+ * requested].
+ * -EAGAIN Not acquired, try again [FIXME: change this,
+ * POSIX uses it for mutexes].
+ * * Not acquired, error.
+ *
+ * Needs f->lock held; on return, it will have been released.
+ *
+ * Note that some operations involving flag reading don't need locking
+ * because for changing those values, we calling task needs to be the
+ * owner of the fulock, and once we come out of wait without errors,
+ * we are the owners.
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled [the
+ * fuqueue lock acquired with spin_lock_irq()].
+ */
+int __fulock_lock (struct fulock *fulock, signed long timeout)
+{
+ int prio_changed;
+ int result = 0;
+ int prio = BOTTOM_PRIO;
+ struct fulock_ops *ops =
+ container_of (fulock->fuqueue.ops, struct fulock_ops, fuqueue);
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p [flags 0x%x], %ld)\n", fulock, fulock->flags, timeout);
+
+ result = -ENOTRECOVERABLE;
+ if (fulock->flags & FULOCK_FL_NR) /* can we lock? */
+ goto out_unlock;
+ if (fulock->flags & FULOCK_FL_ERROR_CHK) {
+ result = __fulock_check_deadlock (current, fulock);
+ if (result < 0)
+ goto out_unlock;
+ }
+ if (fulock->flags & FULOCK_FL_PP) {
+ result = __fulock_pp_allowed (fulock);
+ if (result)
+ goto out_unlock;
+ }
+ if (fulock->owner == NULL) /* Unlocked? take it */
+ goto its_unlocked;
+ /* Nchts, we have to wait. */
+ result = -EBUSY;
+ if (!timeout)
+ goto out_unlock;
+ prio_changed = __fuqueue_waiter_queue (&fulock->fuqueue, &w);
+ if (prio_changed
+ && fulock->flags & FULOCK_FL_PI
+ && __fulock_prio_update (fulock))
+ __fuqueue_waiter_chprio (fulock->owner);
+ return __fuqueue_waiter_block (&fulock->fuqueue, &w, timeout);
+
+its_unlocked:
+ result = fulock->flags & FULOCK_FL_DEAD? -EOWNERDEAD : 0;
+ prio_changed = ops->owner_set (fulock, current);
+ if (prio_changed) {
+ /* The ownership list's prio changed, update boost */
+ prio = olist_prio (&fulock->olist_node);
+ __prio_boost (current, prio);
+ }
+out_unlock:
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ local_irq_enable();
+ preempt_enable();
+ return result;
+}
+
+
+/**
+ * Unlock a fulock, wake up waiter(s) [internal version]
+ *
+ * @fulock: Address for the fulock (kernel space).
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * futex, so that no other task (in user or kernel space)
+ * can acquire it.
+ * @code: If waking up in parallel mode, return code to be passed to
+ * the waiters as a result of their wait.
+ * @returns: 0 if the fulock was unlocked/there are no waiters left or
+ * howmany > 0
+ * 1 fulock ownership transferred to 1st waiter, there was
+ * one waiter (now none)
+ * 2 fulock ownership transferred to 1st waiter, there was
+ * more than one waiter
+ *
+ * ufulock_unlock() uses this to update the vfulock, except
+ * if howmany > 0.
+ *
+ * Requires fulock->fuqueue.lock held, IRQs & preempt disabled!!!
+ */
+#warning TODO: automatic selection of unlock mode [as suggested by Jamie]
+int __fulock_unlock (struct fulock *fulock, size_t howmany, int code)
+{
+ int result = 0;
+ int old_prio_changed, new_prio_changed;
+ struct fulock_ops *ops =
+ container_of (fulock->fuqueue.ops, struct fulock_ops, fuqueue);
+ struct task_struct *old_owner, *new_owner;
+ struct fuqueue_waiter *w = NULL;
+
+ ftrace ("(%p, %zu, %d)\n", fulock, howmany, code);
+
+ if (fulock->owner == NULL) /* Unlocked? */
+ goto out;
+ old_owner = fulock->owner;
+ old_prio_changed = ops->owner_reset (fulock);
+ if (__fulock_empty (fulock))
+ goto out_unboost;
+ if (howmany > 0) { /* Parallel unlock */
+ code = code == 0? -EAGAIN : code;
+ while (howmany-- && !__fuqueue_empty (&fulock->fuqueue)) {
+ w = __fuqueue_first (&fulock->fuqueue);
+ __fuqueue_waiter_unqueue (w, code);
+ wake_up_process (w->task);
+ }
+ if (likely (w != NULL))
+ wlist_update_prio (&fulock->fuqueue.wlist);
+ }
+ else { /* Serialized unlock */
+ w = __fuqueue_first (&fulock->fuqueue);
+ code = fulock->flags & FULOCK_FL_DEAD? -EOWNERDEAD : 0;
+ __fuqueue_waiter_unqueue (w, code);
+ /* Do PI: Update the wlist's top priority (in case it
+ * changed). If it did, then change the prio of the
+ * fulock (in the olist_node); will chprio the task later */
+ new_prio_changed = wlist_update_prio (&fulock->fuqueue.wlist);
+ if (new_prio_changed && fulock->flags & FULOCK_FL_PI)
+ __olist_chprio (&fulock->olist_node,
+ wlist_prio (&fulock->fuqueue.wlist));
+ /* Now register owner, queue the fulock in w->task's
+ * ownership list; if prio of olist changes, boost */
+ new_owner = w->task;
+ ldebug (2, "olist prio before upgrade %d\n",
+ olist_prio (&new_owner->fulock_olist));
+ new_prio_changed = ops->owner_set (fulock, new_owner);
+ ldebug (2, "olist prio after upgrade %d, new_prio_changed %d\n",
+ olist_prio (&new_owner->fulock_olist), new_prio_changed);
+ if (new_prio_changed) {
+ int prio = olist_prio (&new_owner->fulock_olist);
+ ldebug (2, "__prio_boost pid %d to %d\n", new_owner->pid, prio);
+ __prio_boost (new_owner, prio);
+ }
+ wake_up_process (new_owner);
+ result = __fulock_empty (fulock)? 1 : 2;
+ }
+ /* Now we can unboost (if we have to). prio_changed can be
+ * true only if PI or PP (and the prio actually changed) */
+out_unboost:
+ ldebug (2, "old_prio_changed %d, new boost prio %d\n",
+ old_prio_changed, wlist_prio (&fulock->fuqueue.wlist));
+ if (old_prio_changed
+ && __prio_boost (old_owner, wlist_prio (&fulock->fuqueue.wlist))) {
+ ldebug (2, "here\n");
+ __fuqueue_waiter_chprio (old_owner);
+ }
+out:
+ return result;
+}
+
+
+/**
+ * Modify/query the internal state of a fulock.
+ *
+ * @fulock: fulock to modify.
+ * @ctl: control command to issue.
+ * @returns: health state
+ */
+int __fulock_ctl (struct fulock *fulock, enum fulock_ctl ctl)
+{
+ int result;
+ enum fulock_st state;
+
+ ftrace ("(%p, %d)\n", fulock, ctl);
+
+ /* Gather actual state */
+ if (fulock->flags & FULOCK_FL_NR)
+ state = fulock_st_nr;
+ else if (fulock->flags & FULOCK_FL_DEAD)
+ state = fulock_st_dead;
+ else
+ state = fulock_st_healthy;
+
+ /* Now, what to do? */
+ switch (ctl) {
+ /* Nothing, so just say how are we standing */
+ case fulock_ctl_nop:
+ result = state;
+ break;
+
+ /* Reinitialize it */
+ case fulock_ctl_init:
+ fulock->flags &= ~(FULOCK_FL_DEAD | FULOCK_FL_NR);
+ __fulock_unlock (fulock, (size_t) ~0, -ENOENT);
+ result = fulock_st_init;
+ break;
+
+ /* Mark it healthy */
+ case fulock_ctl_heal:
+ result = -EINVAL;
+ if (state != fulock_st_dead)
+ break;
+ result = -EPERM;
+ if (fulock->owner != current) /* Who are you? */
+ break;
+ fulock->flags &= ~FULOCK_FL_DEAD;
+ result = fulock_st_healthy;
+ break;
+
+ /* Make it not recoverable; wake up every waiter with error;
+ * unlock. */
+ case fulock_ctl_nr:
+ result = -EINVAL;
+ if (state != fulock_st_dead)
+ break;
+ result = -EPERM;
+ if (fulock->owner != current) /* Who are you? */
+ break;
+ result = __fulock_unlock (fulock, (size_t)~0, -ENOTRECOVERABLE);
+ if (result >= 0) {
+ fulock->flags &= ~FULOCK_FL_DEAD;
+ fulock->flags |= FULOCK_FL_NR;
+ result = fulock_st_nr;
+ }
+ break;
+
+ /* Set the prio ceiling */
+ case fulock_ctl_setprioceil:
+#warning FIXME: this is most probably WRONG
+#warning FIXME: Finish me, set prio ceil
+#if 0
+ result = -EINVAL;
+ if (!(fulock->flags & FULOCK_FL_PP))
+ break;
+ result = -EPERM;
+ if (fulock->owner != current) /* Who are you? */
+ break;
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ prio = fulock->flags & FULOCK_FL_PP_MK;
+ prio_changed = olist_chprio (&current->fulock_olist,
+ &fulock->olist_node, prio);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ if (prio_changed)
+ __prio_boost (owner, prio);
+#endif
+ result = 0;
+ break;
+
+ default:
+ result = -EINVAL;
+ }
+ return result;
+}
+
+
+/**
+ * Set the priority of a fulock waiter.
+ *
+ * @task: task to re-prioritize
+ * @fuqueue: fuqueue of the fulock the task is waiting for [locked]
+ * @w: waiter @task is waiting on in @fuqueue.
+ * @returns: NULL (if there is no need to propagate), task where
+ * propagation would need to continue.
+ *
+ * This does not set the prio of the process itself!
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+struct task_struct * __fulock_op_waiter_chprio (
+ struct task_struct *task, struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ struct fuqueue_ops *ops;
+ struct fulock *fulock;
+ int prio_changed;
+
+ ftrace ("(%p [%d], %p, %p)\n", task, task->pid, fuqueue, w);
+
+ /* Verify this is really a fulock */
+ ops = fuqueue->ops;
+ BUG_ON (ops != &fulock_ops.fuqueue && ops != &ufulock_ops.fuqueue);
+ fulock = container_of (fuqueue, struct fulock, fuqueue);
+ /* Update the waiter's position in the fulock's wlist */
+ prio_changed = wlist_chprio (&fuqueue->wlist, &w->wlist_node,
+ task->prio);
+ /* The prio change of the waiter caused the wlist prio to
+ * change; if we are PI, we change the prio of the fulock AND
+ * if that changed the prio of the owner, return it. */
+ if (prio_changed
+ && fulock->flags & FULOCK_FL_PI
+ && __fulock_prio_update (fulock))
+ return fulock->owner;
+ return NULL;
+}
+
+
+/**
+ * Initialize fulock specific stuff for a task
+ *
+ */
+void init_fulock (struct task_struct *task)
+{
+ __ftrace (0, "(task %p)\n", task);
+
+ spin_lock_init (&task->fuqueue_wait_lock);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter = NULL;
+ spin_lock_init (&task->fulock_olist_lock);
+ olist_init (&task->fulock_olist);
+}
+
+
+/** Release as dead a @fulock because the owner is exiting. */
+void __fulock_op_exit (struct fulock *fulock)
+{
+ ftrace ("(%p)\n", fulock);
+
+ fulock->flags |= FULOCK_FL_DEAD;
+ if (!(fulock->flags & FULOCK_FL_RM))
+ printk (KERN_WARNING "Task %d [%s] exited holding non-robust "
+ "fulock %p; waiters might block for ever\n",
+ current->pid, current->comm, fulock);
+ __fulock_unlock (fulock, 0, -EOWNERDEAD);
+}
+
+
+/**
+ * When @task exits, release all the fulocks it holds as dead.
+ *
+ * We have to discriminate between ufulocks and locks; when it is an
+ * ufulock, we just need to see if we have to put() the reference that
+ * the owner had or not (when the ownership is successfully passed to
+ * somebody else, we don't have to put it).
+ */
+#warning FIXME: hook up to exec()?
+void exit_fulocks (struct task_struct *task)
+{
+ olist_t *itr;
+ struct fulock *fulock;
+ struct fulock_ops *ops;
+ unsigned long flags;
+
+ if (DEBUG > 0 && !olist_empty (&task->fulock_olist))
+ ftrace ("(%p [%d])\n", task, task->pid);
+
+ /* FIXME: there is a better way to do this, but I feel toooo
+ * thick today -- the problem is fulock->ops->exit() is going
+ * to take fulock_olist_lock to reset the ownership...so
+ * whatever it is, we have to call without holding it. */
+ local_irq_save (flags);
+ preempt_disable();
+ _raw_spin_lock (&task->fulock_olist_lock);
+ while (!olist_empty (&task->fulock_olist)) {
+ itr = olist_first (&task->fulock_olist);
+ fulock = container_of (itr, struct fulock, olist_node);
+ ldebug (7, "task %p [%d] still owns fulock %p\n",
+ task, task->pid, fulock);
+ ops = container_of (fulock->fuqueue.ops,
+ struct fulock_ops, fuqueue);
+ if (ops->fuqueue.get)
+ ops->fuqueue.get (&fulock->fuqueue);
+ _raw_spin_unlock (&task->fulock_olist_lock);
+
+ _raw_spin_lock (&fulock->fuqueue.lock);
+ ops->exit (fulock); /* releases fulock lock */
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+
+ if (ops->fuqueue.put)
+ ops->fuqueue.put (&fulock->fuqueue);
+ _raw_spin_lock (&task->fulock_olist_lock);
+ }
+ _raw_spin_unlock (&task->fulock_olist_lock);
+ local_irq_restore (flags);
+ preempt_enable();
+}
+
+
+/** Cancel @task's wait on @fuqueue and update the wait list priority */
+unsigned __fulock_op_waiter_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ unsigned prio_changed;
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+
+ ftrace ("(%p, %p [%d], %p)\n",
+ fuqueue, w, w->task->pid, w);
+
+ prio_changed = __fuqueue_op_waiter_cancel (fuqueue, w);
+ ldebug (2, "prio_change is %u\n", prio_changed);
+ if (prio_changed
+ && fulock->flags & FULOCK_FL_PI
+ && __fulock_prio_update (fulock))
+ __fuqueue_waiter_chprio (fulock->owner);
+ return prio_changed;
+}
+
+
+/** Fulock operations */
+struct fulock_ops fulock_ops = {
+ .fuqueue = {
+ .get = NULL,
+ .put = NULL,
+ .waiter_cancel = __fulock_op_waiter_cancel,
+ .waiter_chprio = __fulock_op_waiter_chprio
+ },
+ .owner_set = __fulock_op_owner_set,
+ .owner_reset = __fulock_op_owner_reset,
+ .exit = __fulock_op_exit
+};
+

2004-04-21 04:17:03

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 11/11: user space fulocks

ufulock.c | 1044 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 1044 insertions(+)

--- /dev/null Thu Apr 15 00:58:26 2004
+++ kernel/ufulock.c Thu Apr 8 00:17:49 2004
@@ -0,0 +1,1044 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ *
+ * The ufulocks are fulocks with a vlocator to help pinpoint them from
+ * a user space address.
+ *
+ * The vfulock is the value in an address in user space that is
+ * associated to the ufulock. The trick here is to keep them in sync
+ * [__vfulock_sync()]. I think it is doing the right thing when we
+ * know about a ufulock in the kernel and something wipes it to some
+ * crappy value in user space (wild pointer)--however, it cannot be
+ * perfect.
+ *
+ * Mapping the address in user space to the ufulock is done through a
+ * hash table kept by the vlocator.h code - ufulock_locate() does the
+ * full work; give it an adress, you get a ufulock. No need to worry
+ * about anything else.
+ *
+ * QUICK ROADMAP:
+ *
+ * sys_ufulock_lock(): [try]lock a fulock from user space
+ * sys_ufulock_unlock(): unlock a fulock from user space
+ * sys_ufulock_ctl(): control the state of a ufulock
+ * __ufulock_exit(): called from fulock.c:exit_fulocks() when a
+ * process exits and still holds fulocks.
+ * ufulock_gc(): called from ufu-common.c:ufu_gc() to do garbage
+ * collection.
+ *
+ * TODOs:
+ *
+ * - All the #warning FIXME's need to be sorted out
+ *
+ * - There is no need to change the flags of a mutex once it has been
+ * initialized [POSIX doesn't provide tools for it anyway], except
+ * for the prioceiling. Still need to provide a way for user space
+ * to tell kernel space that we are relinquishing a fulock forever
+ * (for pthread_mutex_destroy()) so we can clean it up and reinit with
+ * different flags; as well, call in from pthread_mutex_init() so we
+ * cache the fulock?).
+ */
+
+#warning FIXME: page unpinning still not that done
+/*
+ * The problem is we need to access the page to modify the value on
+ * when a ufulock dies. We could do it on the exit path, but we don't
+ * have a way to locate the page when it is shared on
+ * __ufulock_exit(). We could have the sys_ufulock_lock() return path
+ * do it, but then we don't update the vfulock to VFULOCK_DEAD when
+ * there are no waiters.
+ *
+ * So, until we find a solution for that, pages are still pinned. It
+ * also means we have to pass the page all the way down to
+ * fulock->ops->owner_set(). A solution would get rid of that
+ * too. It's plain ugly and messy.
+ */
+
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/plist.h>
+#include <asm/uaccess.h> /* copy_from_user() */
+#include <linux/init.h>
+
+#include <linux/futex.h> /* futex keys */
+#include <linux/vlocator.h>
+#include <linux/workqueue.h>
+#include <linux/fulock.h>
+
+#warning DEBUG: remove me
+//#include <asm/kgdb.h>
+
+#if DEBUG > 5
+#define __DEBUG_get_value(result, _vfulock) \
+do { \
+ unsigned val = *_vfulock; \
+ debug ("result %d * _vfulock %u/%x\n", result, val, val); \
+} while (0)
+#define DEBUG_get_value(result, _vfulock) \
+do { \
+ unsigned val; \
+ get_user (val, _vfulock); \
+ debug ("result %d * _vfulock %u/%x\n", result, val, val); \
+} while (0)
+#else
+#define __DEBUG_get_value(a,b) do {} while (0)
+#define DEBUG_get_value(a,b) do {} while (0)
+#endif
+
+extern signed long ufu_timespec2jiffies (const struct timespec __user *);
+
+/** Slab for allocation of 'struct ufulock's. */
+static kmem_cache_t *ufulock_slab;
+
+
+struct fulock_ops ufulock_ops;
+
+
+/** Set an ufulock ownership and increment its use count. */
+static inline
+unsigned __ufulock_op_owner_set (struct fulock *fulock, struct task_struct *task)
+{
+ unsigned prio_changed;
+ struct ufulock *ufulock;
+
+ ftrace ("(%p, %p [%d])\n", fulock, task, task->pid);
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ prio_changed = __fulock_op_owner_set (fulock, task);
+ vl_get (&ufulock->vlocator);
+ return prio_changed;
+}
+
+
+/** Reset an ufulock ownership and drop its use count. */
+static inline
+unsigned __ufulock_op_owner_reset (struct fulock *fulock)
+{
+ unsigned prio_changed;
+ struct ufulock *ufulock;
+
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ ftrace ("(%p [ufulock %p])\n", fulock, ufulock);
+ prio_changed = __fulock_op_owner_reset (fulock);
+ vl_put (&ufulock->vlocator);
+ return prio_changed;
+}
+
+
+/** Initialize an @ufulock (for slab object construction). */
+static
+void ufulock_ctor (void *obj, kmem_cache_t *slab, unsigned long flags)
+{
+ struct ufulock *ufulock = obj;
+ __fulock_init (&ufulock->fulock, &ufulock_ops);
+ vlocator_init (&ufulock->vlocator);
+}
+
+
+/** Allocate an ufulock maintaining debug-allocation counts. */
+static __inline__
+struct vlocator * ufulock_alloc (void)
+{
+ struct ufulock *ufulock;
+ ufulock = kmem_cache_alloc (ufulock_slab, GFP_KERNEL);
+ if (DEBUG > 0 && ufulock != NULL)
+ ldebug (2, "ufulock %p allocated, total: %d\n",
+ ufulock, allocated_inc());
+ return &ufulock->vlocator;
+}
+
+
+/**
+ * Create a ufulock that is going to be inserted to the hash
+ * [called from ufu-common.c:ufu_locate()]. Reference its resources
+ * and initialize what is not initialized by the cache constructor.
+ */
+static
+void ufulock_create (struct vlocator *vl, struct page *page,
+ const union futex_key *key, unsigned flags)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ ufulock->fulock.flags = (flags & FULOCK_FL_USER_MK)
+ | FULOCK_FL_NEEDS_SYNC;
+#warning FIXME: if PP, transform flags from POLICY+PRIO to RAWPRIO.
+ memcpy (&vl->key, key, sizeof (*key));
+ get_key_refs (&vl->key);
+ get_page (page);
+ ufulock->page = page;
+}
+
+
+/** Release @ufulock's resources. */
+static __inline__
+void ufulock_release (struct vlocator *vl)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ BUG_ON (ufulock->page == NULL);
+ put_page (ufulock->page);
+ drop_key_refs (&vl->key);
+}
+
+
+/** Free an ufulock maintaining debug-allocation counts. */
+static __inline__
+void ufulock_free (struct vlocator *vl)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ kmem_cache_free (ufulock_slab, ufulock);
+ if (DEBUG > 0 && ufulock != NULL)
+ ldebug (2, "ufulock %p freed, total: %d\n",
+ ufulock, allocated_dec());
+}
+
+
+struct vlocator_ops ufulock_vops = {
+ .alloc = ufulock_alloc,
+ .create = ufulock_create,
+ .release = ufulock_release,
+ .free = ufulock_free
+};
+
+
+
+/**
+ * Set @vfulock, maintaining consistency with @fulock.
+ *
+ * We just make sure that whichever value we put doesn't override the
+ * VFULOCK_DEAD or VFULOCK_NR indicators.
+ */
+static __inline__
+void __vfulock_update (volatile unsigned *vfulock,
+ const struct fulock *fulock, unsigned value)
+{
+ ftrace ("(%p, %p, %u/%x)\n", vfulock, fulock, value, value);
+
+ if (fulock->flags & FULOCK_FL_DEAD)
+ value = VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ value = VFULOCK_NR;
+ vfulock_set (vfulock, value);
+ return;
+}
+
+
+/** @returns expected value of the vfulock given the state of the
+ * @fulock. */
+static inline
+unsigned __vfulock_expected_nonkco (struct fulock *fulock)
+{
+ if (fulock->flags & FULOCK_FL_DEAD)
+ return VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ return VFULOCK_NR;
+ else if (fulock->owner == NULL)
+ return VFULOCK_UNLOCKED;
+ else if (__fulock_empty (fulock) && fulock->owner == NULL)
+ return fulock->owner->pid;
+ else
+ return VFULOCK_WP;
+}
+
+
+/**
+ * Perform owner identification chores for __vfulock_sync()
+ *
+ * Probably one of the dirtiest functions I've ever written. The thing
+ * that makes it complex is that after we assign ownership, somebody
+ * the target task might start to exit, and we don't know if it has
+ * gone through exit_fulocks() after we added the fulock to its
+ * ownership list, so we have to be paranoid.
+ *
+ * FIXME: false positives if the PID is reused are the rule. Remedies:
+ *
+ * - The chance of that happening will depend on what the reuse rate is,
+ * how frequently are we assigning new PIDs and the probability of
+ * the sucker dying in the middle of it (so a well-coded program
+ * should be fine, except for the OOM killer, or a crazy process
+ * killing -9 randomly...).
+ *
+ * Let's say the probability is P (one in X)
+ *
+ * - Limit PIDs to 64Ki - 3 (to account for _NR, _DEAD and _WP), create
+ * a 16 bit hash for each task with immutable elements (eg: pointer,
+ * start time...); compose the hash and the 16 bit pid into a 32 bit
+ * thing that the user space uses for fast-locking. When id'ing,
+ * locate the task by PID, hash it out and compare it; if it fails,
+ * the PID was reused, if not, keep on:
+ *
+ * (there are still chances of collision, but I think we have cut
+ * P down by 64Ki)
+ *
+ * - Check the mm mappings. If the task doesn't have the mapping where
+ * the fulock is, well, that was a collision (and that was chance,
+ * probably like shoting up to the sky in the middle of the dessert,
+ * the bullet falling back to the ground, hitting this man [who
+ * happened to be wandering around all lost] in the forehead,
+ * killing him, and it happens he is some rotten-rich guy who had
+ * randomly chosen somebody on the phone book to pass on all his
+ * wealth and that one happens to be you).
+ *
+ * - Now, if the task has the same mapping, we assume it is good.
+ *
+ * I think I cannot be more of a pain in the ass, and probably the
+ * mapping thing is overkill; the hash thing looks like more than
+ * enough, and it is pretty cheap (wasn't computing an exact science?
+ * my flaming bullas).
+ *
+ * If still not good enough to you, use KCO mode, sacrificing speed.
+ */
+static
+int __fulock_id_owner (struct fulock *fulock, volatile unsigned *vfulock,
+ unsigned owner_pid, int do_lock)
+{
+ int result = EBUSY;
+ struct task_struct *task;
+
+ ftrace ("(%p, %p, %u, %d)\n", fulock, vfulock, owner_pid, do_lock);
+ CHECK_IRQs();
+
+ BUG_ON (fulock->owner != NULL);
+ _raw_read_lock (&tasklist_lock);
+ task = find_task_by_pid (owner_pid);
+ if (task == NULL || task->flags & PF_EXITING)
+ goto dead_unlock;
+ get_task_struct (task);
+ _raw_read_unlock (&tasklist_lock);
+ __ufulock_op_owner_set (fulock, task);
+ if (unlikely (task->flags & PF_EXITING)) {
+ __ufulock_op_owner_reset (fulock);
+ put_task_struct (task);
+ goto dead;
+ }
+ put_task_struct (task);
+ return result;
+
+dead_unlock:
+ _raw_read_unlock (&tasklist_lock);
+dead:
+ result = -EOWNERDEAD;
+ vfulock_set (vfulock, VFULOCK_DEAD);
+ fulock->flags |= FULOCK_FL_DEAD;
+ if (do_lock)
+ __ufulock_op_owner_set (fulock, current);
+ return result;
+}
+
+
+/**
+ * Sync up a ufulock that supports fast-userspace locking and its associated
+ * vfulock and maybe fast-lock.
+ *
+ * @uf: The ufulock to sync up.
+ * @vfulock: pointer to a kmapped associated value.
+ * @returns: < 0 errno code on error [we did not acquire it]
+ * -EOWNERDEAD [previous owner died, we got it]
+ * 0 if we locked it to a PID (and thus 'current' is owner)
+ * > 0 if we didn't lock it, need to proceed to fulock
+ *
+ * Call with uf->fulock.fuqueue.lock held!
+ *
+ * This syncs up the vfulock and the ufulock (if the ufulock is new,
+ * sync from vfulock->ufulock, or the opposite). It tries to leave the
+ * ufulock as it if where a kernel fulock, for the fulock layer to
+ * operate on it.
+ *
+ * When we see VFULOCK_UNLOCKED and move to PID, we don't set the
+ * owner, because there will be nobody coming down to the kernel to
+ * reset it [at any effect, it is like if we had done a user space
+ * fast-lock operation].
+ *
+ * The vfulock can only be modified in the following ways:
+ * - In user space:
+ * - when fast-locking: VFULOCK_UNLOCKED to PID
+ * - when fast-unlocking: PID to VFULOCK_UNLOCKED
+ * - when reinitializing from not-recoverable: VFULOCK_NR to VFULOCK_UNLOCKED
+ * - Illegally: any other transition [mistake, error, wild pointer]
+ * - In kernel space: When we have the lock over the ufulock
+ * associated to that vfulock.
+ *
+ * That means that as we are called with the lock held, we can modify
+ * the vfulock at will knowing that nobody is going to touch it as
+ * long as it is not VFULOCK_UNLOCKED (fast-lock), a PID
+ * (fast-unlock)--the not-recoverable case is not of concern because
+ * as soon as anyone sees it, they bail out, so we never get to modify
+ * it.
+ *
+ * Now, for those legal-from-user-space modifications, we operate
+ * atomic; the while(1) loop and vfulock_acas() protect against this.
+ */
+static
+int __vfulock_sync_nonkco (struct ufulock *ufulock, volatile unsigned *vfulock,
+ unsigned do_lock)
+{
+ int result = 0;
+ unsigned val, expected = 0, sync_k2u = 0;
+
+ ftrace ("(%p, %p, %u)\n", ufulock, vfulock, do_lock);
+
+ if ((ufulock->fulock.flags & FULOCK_FL_NEEDS_SYNC) == 0) {
+ sync_k2u = 1; /* synch kernel-to-user */
+ expected = __vfulock_expected_nonkco (&ufulock->fulock);
+ }
+ while (1) {
+ val = *vfulock;
+ // kgdb_ts ((unsigned long) ufulock, *vfulock);
+ if (sync_k2u && val != expected) { /* Unexpected changes? */
+ printk (KERN_WARNING "pid %d (%s) (ufulock %p"
+ "): out of sync, val %u/0x%x, "
+ "expected %u/0x%x. Fixing.\n",
+ current->pid, current->comm,
+ ufulock, val, val, expected, expected);
+ if (expected == VFULOCK_UNLOCKED && val < VFULOCK_WP)
+ goto make_kco; /* Legal: locked? */
+ if (expected < VFULOCK_WP && val == VFULOCK_UNLOCKED)
+ goto fast_lock; /* Legal: unlocked? */
+ if (!vfulock_acas (vfulock, val, expected))
+ continue; /* Illegal: reset it */
+ val = expected;
+ }
+ switch (val) {
+ case VFULOCK_UNLOCKED: /* unlocked, fast-lock it */
+fast_lock:
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ if (do_lock == 0)
+ goto out;
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ if (__fulock_empty (&ufulock->fulock)) {
+ if (vfulock_acas (vfulock, val, current->pid))
+ goto out;
+ }
+ else if (vfulock_acas (vfulock, val, VFULOCK_WP)) {
+ result = EBUSY;
+ goto out_synced;
+ }
+ break;
+ case VFULOCK_DEAD: /* idem, but dead owner */
+ ufulock->fulock.flags |= FULOCK_FL_DEAD;
+ case VFULOCK_WP: /* kernel controlled */
+ result = EBUSY;
+ goto out_synced;
+ case VFULOCK_NR: /* gee...gone completely */
+ ufulock->fulock.flags |= FULOCK_FL_NR;
+ result = -ENOTRECOVERABLE;
+ goto out_synced;
+ default: /* This is a PID(locked) no waiters */
+make_kco:
+ if (vfulock_acas (vfulock, val, VFULOCK_WP))
+ goto id_owner;
+ }
+ }
+id_owner: /* Id owner, mark as dead if it died */
+ result = __fulock_id_owner (&ufulock->fulock, vfulock, val, do_lock);
+out_synced:
+ ufulock->fulock.flags &= ~FULOCK_FL_NEEDS_SYNC;
+out: /* We are the sole owners of the lock */
+ __DEBUG_get_value (result, vfulock);
+#warning DEBUG: remove me
+ // if (result == 0)
+ // kgdb_ts ((unsigned long) ufulock, *vfulock);
+ return result;
+}
+
+
+/**
+ * @returns expected value of the vfulock given the state of the
+ * @fulock for KCO mode fulocks.
+ */
+static inline
+unsigned __vfulock_expected_kco (struct fulock *fulock)
+{
+ if (fulock->flags & FULOCK_FL_DEAD)
+ return VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ return VFULOCK_NR;
+ return VFULOCK_HEALTHY;
+}
+
+
+/**
+ * Synchronize a KCO ufulock with the KCO vfulock in user space.
+ *
+ * KCO ufulocks exist only in kernel space when someone owns them (and
+ * when nobody does, for a while in the cache). When it dissapears, we
+ * need to keep somewhere the state of the fulock (if it was healthy,
+ * dead or not-recoverable). We use the vfulock, and VFULOCK_HEALTHY
+ * (for convenience, the same as VFULOCK_UNLOCKED), VFULOCK_DEAD and
+ * VFULOCK_NR.
+ *
+ * The checks are just to be a PITA and catch errors; they will also
+ * fix up stuff (being the information in the kernel the authoritative
+ * reference).
+ */
+static
+int __vfulock_sync_kco (struct ufulock *ufulock, volatile unsigned *vfulock)
+{
+ struct fulock *fulock;
+ int result = EBUSY;
+ unsigned val, new_flags, expected;
+
+ ftrace ("(%p, %p)\n", ufulock, vfulock);
+
+ fulock = &ufulock->fulock;
+ val = *vfulock;
+ new_flags = fulock->flags & ~FULOCK_FL_NEEDS_SYNC;
+ if ((fulock->flags & FULOCK_FL_NEEDS_SYNC) == 0) {
+ expected = __vfulock_expected_kco (fulock);
+ if (val != expected) {
+ printk (KERN_WARNING "pid %d (%s) (ufulock %p"
+ "): out of sync, val %u/0x%x, "
+ "expected %u/0x%x. Fixing.\n",
+ current->pid, current->comm,
+ ufulock, val, val, expected, expected);
+ __vfulock_update (vfulock, fulock, expected);
+ }
+ }
+ switch (val) {
+ case VFULOCK_HEALTHY:
+ break;
+ case VFULOCK_DEAD:
+ new_flags |= FULOCK_FL_DEAD;
+ break;
+ case VFULOCK_NR:
+ result = -ENOTRECOVERABLE;
+ new_flags |= FULOCK_FL_NR;
+ break;
+ /* Why default to not-recoverable? well somebody
+ * screwed up the value of this lock, so we can't be
+ * certain of what was its previous state, so the
+ * safest thing to do is to kill it left and right.
+ */
+ default:
+ printk (KERN_WARNING "pid %d (%s) (KCO ufulock %p): "
+ "invalid value 0x%x. Defaulting to not-recoverable.\n",
+ current->pid, current->comm, ufulock, val);
+ __vfulock_update (vfulock, fulock, VFULOCK_NR);
+ new_flags |= FULOCK_FL_NR;
+ __fulock_unlock (fulock, (size_t)~0, -ENOTRECOVERABLE);
+ result = -ENOTRECOVERABLE;
+ }
+ fulock->flags = new_flags;
+ return result;
+}
+
+
+/**
+ * Sync up a ufulock with the user space vfulock.
+ *
+ * Depending on the type of locking, we go one way or another.
+ */
+static inline
+int __vfulock_sync (struct ufulock *ufulock, volatile unsigned *vfulock,
+ unsigned do_lock)
+{
+ ftrace ("(%p, %p, %u)\n", ufulock, vfulock, do_lock);
+
+ return ufulock->fulock.flags & FULOCK_FL_KCO?
+ __vfulock_sync_kco (ufulock, vfulock) :
+ __vfulock_sync_nonkco (ufulock, vfulock, do_lock);
+}
+
+
+/** Helper for kmapping the user's vfulock to the kernel. */
+static inline
+volatile unsigned * vfulock_kmap (struct ufulock *ufulock)
+{
+ ftrace ("(%p)\n", ufulock);
+ BUG_ON (ufulock->page == NULL);
+ return (volatile unsigned *)
+ (kmap_atomic (ufulock->page, KM_IRQ0)
+ + (ufulock->vlocator.key.both.offset & ~1));
+}
+
+/** Unmap the user's vfulock from the kernel. */
+static inline
+volatile unsigned * vfulock_kunmap (volatile unsigned *vfulock)
+{
+ ftrace ("(%p)\n", vfulock);
+ kunmap_atomic ((void *) (PAGE_MASK & (unsigned long)vfulock), KM_IRQ0);
+}
+
+
+/**
+ * Lock a ufulock, maybe wait for it to be available.
+ *
+ * @returns: 0 if acquired, < 0 errno code on error.
+ * -EAGAIN Not acquired, try again [change this, conflicst
+ * with POSIX]
+ * -EBUSY Not acquired, is locked, didn't block
+ * -EBADR Not acquired, fulock is not recoverable
+ * -ESRCH Acquired, previous owner died, data might be
+ * inconsistent
+ * * Not acquired, error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * What we do is the minimal operations to obtain a fulock that cannot
+ * be distinguished from a kernel-only fulock and then call the fulock
+ * layer for doing the locking.
+ *
+ * For obtaining that fulock, we use the address [ufulock_locate()],
+ * lock it and sync up the vfulock and ufulock [__vfulock_sync()],
+ * then ask the fulock layer to lock for us [__fulock_lock()], if we
+ * have to].
+ */
+int ufulock_lock (struct ufulock *ufulock, unsigned flags,
+ signed long timeout)
+{
+ volatile unsigned *vfulock;
+ int result;
+
+ ftrace ("(%p, %x, %ld)\n", ufulock, flags, timeout);
+ might_sleep();
+
+ /* Check flags */
+ spin_lock_irq (&ufulock->fulock.fuqueue.lock);
+ result = -EINVAL;
+ if (flags != (ufulock->fulock.flags & FULOCK_FL_USER_MK))
+ goto out_unlock;
+ /* sync up kernel and user space, maybe fast-lock */
+try_again:
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 1);
+ // kgdb_ts ((unsigned long) ufulock, result);
+ vfulock_kunmap (vfulock);
+ if (result <= 0) /* Error or fast-locked */
+ goto out_unlock;
+ result = __fulock_lock (&ufulock->fulock, timeout);
+ if (result == -EAGAIN) {
+ spin_lock_irq (&ufulock->fulock.fuqueue.lock);
+ goto try_again; /* Non-serialized wake up */
+ }
+ // kgdb_ts ((unsigned long) ufulock, result);
+ return result;
+
+out_unlock:
+ spin_unlock_irq (&ufulock->fulock.fuqueue.lock);
+ // kgdb_ts ((unsigned long) ufulock, result);
+ return result;
+}
+
+
+/**
+ * Lock a ufulock, maybe wait for it to be available.
+ *
+ * @returns: 0 if acquired, < 0 errno code on error.
+ * -EAGAIN Not acquired, try again [change this, conflicst
+ * with POSIX]
+ * -EBUSY Not acquired, is locked, didn't block
+ * -EBADR Not acquired, fulock is not recoverable
+ * -ESRCH Acquired, previous owner died, data might be
+ * inconsistent
+ * * Not acquired, error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * Massage everything, locate an ufulock and lock it with uflock_lock().
+ */
+asmlinkage
+int sys_ufulock_lock (volatile unsigned __user *_vfulock, unsigned flags,
+ struct timespec __user *_timeout)
+{
+ struct ufulock *ufulock;
+ struct vlocator *vl;
+ signed long timeout;
+ int result = -EINVAL;
+
+ ftrace ("(%p, 0x%x, %p)\n", _vfulock, flags, _timeout);
+
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ timeout = ufu_timespec2jiffies (_timeout);
+ result = (int) timeout;
+ if (unlikely (timeout < 0))
+ goto out_put;
+ // kgdb_ts ((unsigned long) _vfulock, (unsigned long) ufulock);
+ result = ufulock_lock (ufulock, flags, timeout);
+out_put:
+ vl_put (vl);
+out:
+ DEBUG_get_value (result, _vfulock);
+ return result;
+}
+
+
+/**
+ * Sync up an unfulock, then unlock it.
+ *
+ * Note the gimmick: when it is an unserialized unlock (howmany > 0),
+ * we need to declare the fulock FULOCK_FL_NEEDS_SYNC so the next time
+ * somebody gets into __vfulock_sync_nonkco(), it will properly sync
+ * from userspace.
+ *
+ * @ufulock: ufulock structure.
+ * @flags: flags from users pace for the ufulock
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * lock, so that no other task (in user or kernel space)
+ * can acquire it in the meantime.
+ * @returns: 0 if ok, < 0 errno code on error.
+ */
+int __ufulock_unlock (struct ufulock *ufulock, unsigned flags,
+ size_t howmany)
+{
+ volatile unsigned *vfulock;
+ int result = -EINVAL;
+ struct fulock *fulock = &ufulock->fulock;
+
+ ftrace ("(%p, %x, %zu)\n", ufulock, flags, howmany);
+ CHECK_IRQs();
+
+ if (flags != (fulock->flags & FULOCK_FL_USER_MK))
+ goto out;
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 0);
+ if (result < 0)
+ goto out_kunmap;
+ /* Now do a proper unlock and update the vfulock */
+ if (howmany > 0)
+ __vfulock_update (vfulock, fulock, VFULOCK_UNLOCKED);
+ result = __fulock_unlock (fulock, howmany, 0);
+ if (result < 0)
+ goto out_kunmap;
+ else if (howmany == 0) {
+ unsigned new_val;
+ if (fulock->flags & FULOCK_FL_KCO)
+ new_val = VFULOCK_HEALTHY;
+ else if (result == 0)
+ new_val = VFULOCK_UNLOCKED;
+ else if (result == 1) { /* enable fast-unlock */
+ new_val = fulock->owner->pid;
+ fulock->flags |= FULOCK_FL_NEEDS_SYNC;
+ __ufulock_op_owner_reset (fulock);
+ }
+ else
+ new_val = VFULOCK_WP;
+ __vfulock_update (vfulock, &ufulock->fulock, new_val);
+ }
+ else
+ fulock->flags |= FULOCK_FL_NEEDS_SYNC;
+ result = 0;
+out_kunmap:
+ vfulock_kunmap (vfulock);
+out:
+ return result;
+}
+
+
+/**
+ * Unlock an ufulock, waking up waiter(s)
+ *
+ * @_vfulock: Address for the fulock (user space).
+ * @howmany: Number of waiters to wake up [see ufulock_unlock()]
+ * @returns: 0 if ok, < 0 errno code on error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * There is a gloomy thing here. We should be just looking for a
+ * ufulock associated to _vfulock and if not found, assume there are
+ * no waiters and just set *_vfulock to unlocked.
+ *
+ * But by doing that, we can provoke a race condition; if another
+ * thread B just came in, saw it locked (vfulock is locker's pid),
+ * went into the kernel, saw it unlocked and acquired it, then it
+ * would set the vfulock to VFULOCK_WP and the original thread A in
+ * this moment sets the vfulock to unlocked -- havoc happens.
+ *
+ * So we force to do exactly the same thing as lock() do, we locate();
+ * this way, we also force the caching of the fulock to be
+ * refreshed. Then, the setting of the *_vfulock [from inside the
+ * kernel] is protected by the spinlock.
+ */
+asmlinkage
+int sys_ufulock_unlock (volatile unsigned __user *_vfulock, unsigned flags,
+ size_t howmany)
+{
+ struct vlocator *vl;
+ struct ufulock *ufulock;
+ int result = -ENOMEM;
+ unsigned long cpu_flags;
+
+ ftrace ("(%p, 0x%x, %zu)\n", _vfulock, flags, howmany);
+
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ spin_lock_irqsave (&ufulock->fulock.fuqueue.lock, cpu_flags);
+ result = __ufulock_unlock (ufulock, flags, howmany);
+ spin_unlock_irqrestore (&ufulock->fulock.fuqueue.lock, cpu_flags);
+ vl_put (vl);
+out:
+ DEBUG_get_value (result, _vfulock);
+ return result;
+}
+
+
+/**
+ * (re)Initialize an ufulock.
+ *
+ * @ufulock: who to reinitialize.
+ * @flags: with thiiiis flags.
+ */
+static
+int __ufulock_init (struct ufulock *ufulock, unsigned flags)
+{
+ volatile unsigned *vfulock;
+
+ ftrace ("(%p, 0x%x)\n", ufulock, flags);
+ /* Initialization will not fail... */
+ __fulock_ctl (&ufulock->fulock, fulock_ctl_init);
+ ufulock->fulock.flags = (flags & FULOCK_FL_USER_MK)
+ | FULOCK_FL_NEEDS_SYNC;
+ vfulock = vfulock_kmap (ufulock);
+ vfulock_set (vfulock, VFULOCK_UNLOCKED);
+ vfulock_kunmap (vfulock);
+ return 0;
+}
+
+
+/**
+ * Run a control command on a fulock.
+ *
+ * @_vfulock: Location of the fulock in user space
+ * @flags: flags for the fulock
+ * @ctl: Command to run (enum fulock_ctl).
+ * @returns: >= 0 previous consistency before the change (enum fulock_st).
+ * < 0 errno code on error.
+ *
+ * USER CONTEXT ONLY
+ */
+asmlinkage
+int sys_ufulock_ctl (unsigned __user *_vfulock, unsigned flags, unsigned ctl)
+{
+ struct vlocator *vl;
+ struct fulock *fulock;
+ struct ufulock *ufulock;
+ int result;
+ unsigned new_value;
+ volatile unsigned *vfulock;
+
+ ftrace ("(%p, 0x%x, %u)\n", _vfulock, flags, ctl);
+ might_sleep();
+
+ /* Ugly special case numero uno: destruction; can't really
+ * destroy it (somebody might be using it still), but we can
+ * disconnect it from the hash until the gc destroys it. */
+ if (ctl == fulock_ctl_destroy) {
+ vl_dispose (&ufulock_vops, _vfulock);
+ return 0;
+ }
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ fulock = &ufulock->fulock;
+ spin_lock_irq (&fulock->fuqueue.lock);
+ /* Ugly special case numero two: Initialization */
+ if (ctl == fulock_ctl_init) {
+ result = __ufulock_init (ufulock, flags);
+ goto out_unlock;
+ }
+ /* make sure flags are consistent */
+ result = -EINVAL;
+ if (flags != (fulock->flags & FULOCK_FL_USER_MK))
+ goto out_unlock;
+ /* Ugly special case number three: do we have waiters? */
+ if (ctl == fulock_ctl_waiters) {
+ result = __fulock_empty (fulock)? 0 : 1;
+ goto out_unlock;
+ }
+ /* Ugly special case number four: is it locked? */
+ if (ctl == fulock_ctl_locked) {
+ result = fulock->owner == NULL? 0 : 1;
+ goto out_unlock;
+ }
+ /* Map the user space area */
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 0);
+ if (result < 0)
+ goto out_kunmap;
+ /* Set the consistency */
+ result = __fulock_ctl (fulock, ctl);
+ if (result < 0)
+ goto out_kunmap;
+ /* Ok, update the vfulock */
+ switch (result) {
+ case fulock_st_healthy:
+ if (unlikely (fulock->flags & FULOCK_FL_KCO))
+ new_value = VFULOCK_HEALTHY;
+ else if (__fulock_empty (fulock)) {
+ new_value = current->pid;
+ fulock->flags |= FULOCK_FL_NEEDS_SYNC;
+ __ufulock_op_owner_reset (fulock);
+ }
+ else
+ new_value = VFULOCK_WP;
+ break;
+ case fulock_st_dead:
+ new_value = VFULOCK_DEAD;
+ break;
+ case fulock_st_nr:
+ new_value = VFULOCK_NR;
+ break;
+ default:
+ WARN_ON(1);
+ new_value = 0; /* shut gcc up */
+ }
+ vfulock_set (vfulock, new_value);
+ __DEBUG_get_value (result, vfulock);
+ result = 0;
+out_kunmap:
+ vfulock_kunmap (vfulock);
+out_unlock:
+ spin_unlock_irq (&fulock->fuqueue.lock);
+ vl_put (vl);
+out:
+ return result;
+}
+
+
+/** Release as dead @ufulock because the owner is exiting. */
+void __ufulock_op_exit (struct fulock *fulock)
+{
+ struct ufulock *ufulock;
+ volatile unsigned *vfulock;
+
+ ftrace ("(%p)\n", fulock);
+
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ vfulock = vfulock_kmap (ufulock);
+ /* Update the vfulock to VFULOCK_DEAD */
+ __vfulock_update (vfulock, fulock, VFULOCK_DEAD);
+ vfulock_kunmap (vfulock);
+ __fulock_op_exit (fulock);
+}
+
+
+/** Cancel @task's wait on the ufulock */
+unsigned __ufulock_op_waiter_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ unsigned prio_changed;
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+
+ ftrace ("(%p, %p [%d], %p)\n", fuqueue, w, w->task->pid, w);
+
+ prio_changed = __fulock_op_waiter_cancel (fuqueue, w);
+ if ((fulock->flags & FULOCK_FL_KCO) == 0
+ && __fulock_empty (fulock)) {
+ /* Re-enable fast unlock */
+ volatile unsigned *vfulock;
+ unsigned value = VFULOCK_UNLOCKED;
+ if (fulock->owner) {
+ value = fulock->owner->pid;
+ ufulock->fulock.flags |= FULOCK_FL_NEEDS_SYNC;
+ __ufulock_op_owner_reset (fulock);
+ }
+ vfulock = vfulock_kmap (ufulock);
+ __vfulock_update (vfulock, fulock, value);
+ vfulock_kunmap (vfulock);
+ }
+ return prio_changed;
+}
+
+
+/* Adaptors for fulock operations */
+static
+void ufulock_put (struct fuqueue *fuqueue)
+{
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+ vl_put (&ufulock->vlocator);
+}
+
+static
+void ufulock_get (struct fuqueue *fuqueue)
+{
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+ vl_get (&ufulock->vlocator);
+}
+
+/** ufulock fulock operations */
+struct fulock_ops ufulock_ops = {
+ .fuqueue = {
+ .get = ufulock_get,
+ .put = ufulock_put,
+ .waiter_cancel = __ufulock_op_waiter_cancel,
+ .waiter_chprio = __fulock_op_waiter_chprio
+ },
+ .owner_set = __ufulock_op_owner_set,
+ .owner_reset = __ufulock_op_owner_reset,
+ .exit = __ufulock_op_exit,
+};
+
+
+/**
+ * Initialize the ufulock subsystem.
+ */
+static
+int __init subsys_ufulock_init (void)
+{
+ ufulock_slab = kmem_cache_create ("ufulock", sizeof (struct ufulock),
+ 0, 0, ufulock_ctor, NULL);
+ if (ufulock_slab == NULL)
+ panic ("ufulock_init(): "
+ "Unable to initialize ufulock slab allocator.\n");
+ return 0;
+}
+__initcall (subsys_ufulock_init);
+
+
+/**
+ * Convert a user 'struct timespec *' to jiffies with a twist
+ *
+ * @_timespec: pointer to user space 'struct timespec'.
+ *
+ * @returns: MAX_SCHEDULE_TIMEOUT
+ * Wait for ever, if @_timespec is (void *) -1.
+ * 0 Don't wait at all (if @_timespec is NULL).
+ * Other: Number of jiffies to wait as specified in
+ * @_timespec.
+ *
+ * I kind of think this is still a little bit twisted -- raise your
+ * hand if you can think of a better arrangement.
+ *
+ * WARNING: might sleep!
+ */
+signed long ufu_timespec2jiffies (const struct timespec __user *_timespec)
+{
+ struct timespec timespec;
+ signed long j;
+
+ might_sleep();
+ if (_timespec == NULL)
+ j = 0;
+ else if (_timespec == (struct timespec *) ~0)
+ j = MAX_SCHEDULE_TIMEOUT;
+ else {
+ if (copy_from_user (&timespec, _timespec, sizeof (timespec)))
+ return -EFAULT;
+ j = timespec_to_jiffies (&timespec);
+ }
+ return j;
+}