2008-08-01 21:23:34

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 0/7] Priority Inheritance enhancements

** RFC for PREEMPT_RT branch, 26-rt1 **

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches. This yields at
least a 10-15% improvement for diskio on my 4-way x86_64 system. An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
(at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
if their non-rt equivalent. We have determined via instrumentation
that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above. Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-6.

(I realize patch #6 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this more readable).

Patch 7 is the first real consumer of the libpi logic to try to enhance
performance. It accomplishes this by deferring pi-boosting a lock
owner unless it is absolutely necessary. Since instrumentation
shows that the majority of locks are acquired either via the fast-path,
or via the adaptive-spin path, we can eliminate most of the pi-overhead
with this technique. This yields a measurable performance gain (at least
10% for workloads with heavy lock contention was observed in our lab).

For your convenience, you can also find these patches in a git tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git pi-rework

We have not yet completed the work on the pi-waitqueues or any of the other
related pi enhancements. Those will be coming in a follow-on announcement.

Feedback/comments welcome!

Regards,
-Greg


2008-08-01 21:23:54

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 1/7] add generalized priority-inheritance interface

The kernel currently addresses priority-inversion through priority-
inheritence. However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure. This causes a few
problems:

1) This tightly coupled relationship makes it difficult to extend to
other areas of the kernel (for instance, pi-aware wait-queues may
be desirable).
2) Enhancing the rtmutex infrastructure becomes challenging because
there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version. The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target. A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source. For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <[email protected]>
---

Documentation/libpi.txt | 59 +++++++
include/linux/pi.h | 226 +++++++++++++++++++++++++++
lib/Makefile | 3
lib/pi.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 685 insertions(+), 1 deletions(-)

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a priority source, and its boosted target.
+
+A pi_source is simply a reference to some arbitrary priority value that may range from 0 (highest prio), to MAX_PRIO (currently 140, lowest prio). A pi_source calls pi_sink.boost() whenever it wishes to boost the sink to (at least minimally) the priority value that the source represents. It uses pi_sink.boost() for both the initial boosting, or for any subsequent refreshes to the value (even if the value is decreasing in logical priority). The policy of the sink will dictate what happens as a result of that boost. Likewise, a pi_source calls pi_sink.deboost() to stop contributing to the sink's minimum priority.
+
+It is important to note that a source is a reference to a priority value, not a value itself. This is one of the concepts that allows the interface to be idempotent, which is important for properly updating a chain of sources and sinks in the proper order. If we passed the priority on the stack, the order in which the system executes could allow the actual value that is set to race.
+
+Nodes:
+
+A pi_node is a convenience object which is simultaneously a source and a sink. As its name suggests, it would typically be deployed as a node in a pi-chain. Other pi_sources can boost a node via its pi_sink.boost() interface. Likewise, a node can boost a fixed number of sinks via the node.add_sink() interface.
+
+Generally speaking, a node takes care of many common operations associated with being a “link in the chain”, such as:
+
+ 1) determining the current priority of the node based on the (logically) highest priority source that is boosting the node.
+ 2) boosting/deboosting upstream sinks whenever the node locally changes priority.
+ 3) taking care to avoid deadlock during a chain update.
+
+Design details:
+
+Destruction:
+
+The pi-library objects are designed to be implicitly-destructable (meaning they do not require an explicit “free()” operation when they are not used anymore). This is important considering their intended use (spinlock_t's which are also implicitly-destructable). As such, any allocations needed for operation must come from internal structure storage as there will be no opportunity to free it later.
+
+Multiple sinks per Node:
+
+We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. “task->pi_blocked_on”). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called “leaf sinks”.
+
+Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
+
+The following diagram depicts an example relationship (warning: cheesy ascii art)
+
+ --------- ---------
+ | leaf | | leaf |
+ --------- ---------
+ / /
+ --------- / ---------- / --------- ---------
+ ->-| node |->---| node |-->---| node |->---| leaf |
+ --------- ---------- --------- ---------
+
+The reason why this was done was to unify the notion of a “sink” to a single interface, rather than having something like task->pi_blocks_on and a separate callback for the leaf action. Instead, any downstream object can be represented by a sink, and the implementation details are hidden (e.g. im a task, im a lock, im a node, im a work-item, im a wait-queue, etc).
+
+Sinkrefs:
+
+Each pi_sink.boost() operation is represented by a unique pi_source to properly facilitate a one node to many source relationship. Therefore, if a pi_node is to act as aggregator to multiple sinks, it implicitly must have one internal pi_source object for every sink that is added (via node.add_sink(). This pi_source object has to be internally managed for the lifetime of the sink reference.
+
+Recall that due to the implicit-destruction requirement above, and the fact that we will typically be executing in a preempt-disabled region, we have to be very careful about how we allocate references to those sinks. More on that next. But long story short we limit the number of sinks to MAX_PI_DEPENDENDICES (currently 5).
+
+Locking:
+
+(work in progress....)
+
+
+
+
+
diff --git a/include/linux/pi.h b/include/linux/pi.h
new file mode 100644
index 0000000..80b8d96
--- /dev/null
+++ b/include/linux/pi.h
@@ -0,0 +1,226 @@
+/*
+ * see Documentation/libpi.txt for details
+ */
+
+#ifndef _LINUX_PI_H
+#define _LINUX_PI_H
+
+#include <linux/list.h>
+#include <linux/plist.h>
+#include <asm/atomic.h>
+
+#define MAX_PI_DEPENDENCIES 5
+
+struct pi_source {
+ struct plist_node list;
+ int *prio;
+ int boosted;
+};
+
+
+#define PI_FLAG_DEFER_UPDATE (1 << 0)
+#define PI_FLAG_ALREADY_BOOSTED (1 << 1)
+
+struct pi_sink {
+ int (*boost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*deboost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*update)(struct pi_sink *snk,
+ unsigned int flags);
+};
+
+enum pi_state {
+ pi_state_boost,
+ pi_state_boosted,
+ pi_state_deboost,
+ pi_state_free,
+};
+
+/*
+ * NOTE: PI must always use a true (e.g. raw) spinlock, since it is used by
+ * rtmutex infrastructure.
+ */
+
+struct pi_sinkref {
+ raw_spinlock_t lock;
+ struct list_head list;
+ enum pi_state state;
+ struct pi_sink *snk;
+ struct pi_source src;
+ atomic_t refs;
+};
+
+struct pi_sinkref_pool {
+ struct list_head free;
+ struct pi_sinkref data[MAX_PI_DEPENDENCIES];
+ int count;
+};
+
+struct pi_node {
+ raw_spinlock_t lock;
+ int prio;
+ struct pi_sink snk;
+ struct pi_sinkref_pool sinkref_pool;
+ struct list_head snks;
+ struct plist_head srcs;
+};
+
+/**
+ * pi_node_init - initialize a pi_node before use
+ * @node: a node context
+ */
+extern void pi_node_init(struct pi_node *node);
+
+/**
+ * pi_add_sink - add a sink as an downstream object
+ * @node: the node context
+ * @snk: the sink context to add to the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ * PI_FLAG_ALREADY_BOOSTED - Do not perform initial boosting
+ *
+ * This function registers a sink to get notified whenever the
+ * node changes priority.
+ *
+ * Note: By default, this function will schedule the newly added sink
+ * to get an inital boost notification on the next update (even
+ * without the presence of a priority transition). However, if the
+ * ALREADY_BOOSTED flag is specified, the sink is initially marked as
+ * BOOSTED and will only get notified if the node changes priority
+ * in the future.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_add_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_del_sink - del a sink from the current downstream objects
+ * @node: the node context
+ * @snk: the sink context to delete from the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a sink from the node.
+ *
+ * Note: The sink will not actually become fully deboosted until
+ * a call to node.update() successfully returns.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_del_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_source_init - initialize a pi_source before use
+ * @src: a src context
+ * @prio: pointer to a priority value
+ *
+ * A pointer to a priority value is used so that boost and update
+ * are fully idempotent.
+ */
+static inline void
+pi_source_init(struct pi_source *src, int *prio)
+{
+ plist_node_init(&src->list, *prio);
+ src->prio = prio;
+ src->boosted = 0;
+}
+
+/**
+ * pi_boost - boost a node with a pi_source
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function registers a priority source with the node, possibly
+ * boosting its value if the new source is the highest registered source.
+ *
+ * This function is used to both initially register a source, as well as
+ * to notify the node if the value changes in the future (even if the
+ * priority is decreasing).
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_boost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->boost)
+ return snk->boost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_deboost - deboost a pi_source from a node
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a priority source from the node, possibly
+ * deboosting its value if the departing source was the highest
+ * registered source.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_deboost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->deboost)
+ return snk->deboost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_update - force a manual chain update
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * This function will push any priority changes (as a result of
+ * boost/deboost or add_sink/del_sink) down through the chain.
+ * If no changes are necessary, this function is a no-op.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_update(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->update)
+ return snk->update(snk, flags);
+
+ return 0;
+}
+
+#endif /* _LINUX_PI_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5187924..df81ad7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -23,7 +23,8 @@ lib-$(CONFIG_SMP) += cpumask.o
lib-y += kobject.o kref.o klist.o

obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
- bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o
+ bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
+ pi.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/pi.c b/lib/pi.c
new file mode 100644
index 0000000..a1646db
--- /dev/null
+++ b/lib/pi.c
@@ -0,0 +1,398 @@
+/*
+ * lib/pi.c
+ *
+ * Priority-Inheritance library
+ *
+ * Copyright (C) 2008 Novell
+ *
+ * Author: Gregory Haskins <[email protected]>
+ *
+ * This code provides a generic framework for preventing priority
+ * inversion by means of priority-inheritance. (see Documentation/libpi.txt
+ * for details)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/pi.h>
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref_pool
+ *-----------------------------------------------------------
+ */
+
+static void
+pi_sinkref_pool_init(struct pi_sinkref_pool *pool)
+{
+ int i;
+
+ INIT_LIST_HEAD(&pool->free);
+ pool->count = 0;
+
+ for (i = 0; i < MAX_PI_DEPENDENCIES; ++i) {
+ struct pi_sinkref *sinkref = &pool->data[i];
+
+ memset(sinkref, 0, sizeof(*sinkref));
+ INIT_LIST_HEAD(&sinkref->list);
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+ }
+}
+
+static struct pi_sinkref *
+pi_sinkref_alloc(struct pi_sinkref_pool *pool)
+{
+ struct pi_sinkref *sinkref;
+
+ BUG_ON(!pool->count);
+
+ if (list_empty(&pool->free))
+ return NULL;
+
+ sinkref = list_first_entry(&pool->free, struct pi_sinkref, list);
+ list_del(&sinkref->list);
+ memset(sinkref, 0, sizeof(*sinkref));
+ pool->count--;
+
+ return sinkref;
+}
+
+static void
+pi_sinkref_free(struct pi_sinkref_pool *pool,
+ struct pi_sinkref *sinkref)
+{
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref
+ *-----------------------------------------------------------
+ */
+
+static inline void
+_pi_sink_addref(struct pi_sinkref *sinkref)
+{
+ atomic_inc(&sinkref->refs);
+}
+
+static inline void
+_pi_sink_dropref(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&node->lock, flags);
+
+ if (atomic_dec_and_test(&sinkref->refs)) {
+ list_del(&sinkref->list);
+ pi_sinkref_free(&node->sinkref_pool, sinkref);
+ }
+
+ spin_unlock_irqrestore(&node->lock, flags);
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_node
+ *-----------------------------------------------------------
+ */
+
+static struct pi_node *node_of(struct pi_sink *snk)
+{
+ return container_of(snk, struct pi_node, snk);
+}
+
+static inline void
+__pi_boost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(src->boosted);
+
+ plist_node_init(&src->list, *src->prio);
+ plist_add(&src->list, &node->srcs);
+ src->boosted = 1;
+}
+
+static inline void
+__pi_deboost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(!src->boosted);
+
+ plist_del(&src->list, &node->srcs);
+ src->boosted = 0;
+}
+
+static int
+_pi_node_update(struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+ int prio;
+ int count = 0;
+ int i;
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ if (!plist_head_empty(&node->srcs))
+ prio = plist_first(&node->srcs)->prio;
+ else
+ prio = MAX_PRIO;
+
+ list_for_each_entry(sinkref, &node->snks, list) {
+ /*
+ * If the priority is changing, or if this is an BOOST/DEBOOST
+ */
+ if (node->prio != prio
+ || sinkref->state != pi_state_boosted) {
+
+ BUG_ON(!atomic_read(&sinkref->refs));
+ _pi_sink_addref(sinkref);
+
+ sinkrefs[count++] = sinkref;
+ }
+ }
+
+ node->prio = prio;
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ /*
+ * Perform the actual operation on each sink
+ */
+ for (i = 0; i < count; ++i) {
+ struct pi_sink *snk;
+ unsigned int lflags = 0;
+ int update = 0;
+
+ sinkref = sinkrefs[i];
+ snk = sinkref->snk;
+
+ spin_lock_irqsave(&sinkref->lock, iflags);
+
+ if (snk->update) {
+ lflags |= PI_FLAG_DEFER_UPDATE;
+ update = 1;
+ }
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ sinkref->state = pi_state_boosted;
+ /* Fall through */
+ case pi_state_boosted:
+ snk->boost(snk, &sinkref->src, lflags);
+ break;
+ case pi_state_deboost:
+ snk->deboost(snk, &sinkref->src, lflags);
+ sinkref->state = pi_state_free;
+
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * above for the duration of the update
+ */
+ atomic_dec(&sinkref->refs);
+ break;
+ case pi_state_free:
+ update = 0;
+ break;
+ default:
+ panic("illegal sinkref type: %d", sinkref->state);
+ }
+
+ spin_unlock_irqrestore(&sinkref->lock, iflags);
+
+ if (update)
+ snk->update(snk, 0);
+
+ _pi_sink_dropref(node, sinkref);
+ }
+
+ return 0;
+}
+
+static int
+_pi_node_boost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ if (src->boosted)
+ __pi_deboost(node, src);
+ __pi_boost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_deboost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ __pi_deboost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static struct pi_sink pi_node_snk = {
+ .boost = _pi_node_boost,
+ .deboost = _pi_node_deboost,
+ .update = _pi_node_update,
+};
+
+void pi_node_init(struct pi_node *node)
+{
+ spin_lock_init(&node->lock);
+ node->prio = MAX_PRIO;
+ node->snk = pi_node_snk;
+ pi_sinkref_pool_init(&node->sinkref_pool);
+ INIT_LIST_HEAD(&node->snks);
+ plist_head_init(&node->srcs, &node->lock);
+}
+
+int pi_add_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ int ret = 0;
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ sinkref = pi_sinkref_alloc(&node->sinkref_pool);
+ if (!sinkref) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_init(&sinkref->lock);
+ INIT_LIST_HEAD(&sinkref->list);
+
+ if (flags & PI_FLAG_ALREADY_BOOSTED)
+ sinkref->state = pi_state_boosted;
+ else
+ /*
+ * Schedule it for addition at the next update
+ */
+ sinkref->state = pi_state_boost;
+
+ pi_source_init(&sinkref->src, &node->prio);
+ sinkref->snk = snk;
+
+ /* set one ref from ourselves. It will be dropped on del_sink */
+ atomic_set(&sinkref->refs, 1);
+
+ list_add_tail(&sinkref->list, &node->snks);
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+
+ out:
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ return ret;
+}
+
+int pi_del_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ /*
+ * There may be multiple matches to snk because sometimes a
+ * deboost/free may still be pending an update when the same
+ * node has been added. So we want to process all instances
+ */
+ list_for_each_entry(sinkref, &node->snks, list) {
+ if (sinkref->snk == snk) {
+ _pi_sink_addref(sinkref);
+ sinkrefs[count++] = sinkref;
+ }
+ }
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ for (i = 0; i < count; ++i) {
+ int remove = 0;
+
+ sinkref = sinkrefs[i];
+
+ spin_lock_irqsave(&sinkref->lock, iflags);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ /*
+ * This state indicates the sink was never formally
+ * boosted so we can just delete it immediately
+ */
+ remove = 1;
+ break;
+ case pi_state_boosted:
+ if (sinkref->snk->deboost)
+ /*
+ * If the sink supports deboost notification,
+ * schedule it for deboost at the next update
+ */
+ sinkref->state = pi_state_deboost;
+ else
+ /*
+ * ..otherwise schedule it for immediate
+ * removal
+ */
+ remove = 1;
+ break;
+ default:
+ break;
+ }
+
+ if (remove) {
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * above for the duration of the update
+ */
+ atomic_dec(&sinkref->refs);
+ sinkref->state = pi_state_free;
+ }
+
+ spin_unlock_irqrestore(&sinkref->lock, iflags);
+
+ _pi_sink_dropref(node, sinkref);
+ }
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+}
+
+
+

2008-08-01 21:24:25

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 2/7] sched: add the basic PI infrastructure to the task_struct

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place. Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 37 +++++++--
include/linux/workqueue.h | 2
kernel/fork.c | 1
kernel/rcupreempt-boost.c | 18 +---
kernel/rtmutex.c | 6 +
kernel/sched.c | 188 ++++++++++++++++++++++++++++++++-------------
kernel/workqueue.c | 39 ++++++++-
7 files changed, 207 insertions(+), 84 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..63ddd1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
#include <linux/task_io_accounting.h>
#include <linux/kobject.h>
#include <linux/latencytop.h>
+#include <linux/pi.h>

#include <asm/processor.h>

@@ -1125,6 +1126,7 @@ struct task_struct {
int prio, static_prio, normal_prio;
#ifdef CONFIG_PREEMPT_RCU_BOOST
int rcu_prio;
+ struct pi_source rcu_prio_src;
#endif
const struct sched_class *sched_class;
struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;

+ struct {
+ struct pi_source src; /* represents normal_prio to 'this' */
+ struct pi_node node;
+ struct pi_sink snk; /* registered to 'this' to get updates */
+ int prio;
+ } pi;
+
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
struct plist_head pi_waiters;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
+ int rtmutex_prio;
+ struct pi_source rtmutex_prio_src;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
#endif
};

+static inline int
+task_pi_boost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_boost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_deboost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_deboost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_update(struct task_struct *p, unsigned int flags)
+{
+ return pi_update(&p->pi.node, flags);
+}
+
#ifdef CONFIG_PREEMPT_RT
# define set_printk_might_sleep(x) do { current->in_printk = x; } while(0)
#else
@@ -1774,14 +1805,8 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-extern void task_setprio(struct task_struct *p, int prio);
-
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
-static inline void rt_mutex_setprio(struct task_struct *p, int prio)
-{
- task_setprio(p, prio);
-}
extern void rt_mutex_adjust_pi(struct task_struct *p);
#else
static inline int rt_mutex_getprio(struct task_struct *p)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 229179e..3dc4ed9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -11,6 +11,7 @@
#include <linux/lockdep.h>
#include <linux/plist.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>
#include <asm/atomic.h>

struct workqueue_struct;
@@ -31,6 +32,7 @@ struct work_struct {
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct plist_node entry;
work_func_t func;
+ struct pi_source pi_src;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b49488d..399a0d0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -990,6 +990,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->rcu_flipctr_idx = 0;
#ifdef CONFIG_PREEMPT_RCU_BOOST
p->rcu_prio = MAX_PRIO;
+ pi_source_init(&p->rcu_prio_src, &p->rcu_prio);
p->rcub_rbdp = NULL;
p->rcub_state = RCU_BOOST_IDLE;
INIT_LIST_HEAD(&p->rcub_entry);
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index 5282b19..9373b9e 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -236,10 +236,8 @@ static void rcu_boost_task(struct task_struct *task)

rcu_trace_boost_task_boost_called(RCU_BOOST_ME);

- if (task->rcu_prio < task->prio) {
+ if (task_pi_boost(task, &task->rcu_prio_src, 0))
rcu_trace_boost_task_boosted(RCU_BOOST_ME);
- task_setprio(task, task->rcu_prio);
- }
}

/**
@@ -281,14 +279,8 @@ void __rcu_preempt_boost(void)

rcu_trace_boost_try_boost(rbd);

- prio = rt_mutex_getprio(curr);
-
if (list_empty(&curr->rcub_entry))
list_add_tail(&curr->rcub_entry, &rbd->rbs_toboost);
- if (prio <= rbd->rbs_prio)
- goto out;
-
- rcu_trace_boost_boosted(curr->rcub_rbdp);

set_rcu_prio(curr, rbd->rbs_prio);
rcu_boost_task(curr);
@@ -353,11 +345,11 @@ void __rcu_preempt_unboost(void)

rcu_trace_boost_unboosted(rbd);

+ task_pi_deboost(curr, &curr->rcu_prio_src, 0);
+
set_rcu_prio(curr, MAX_PRIO);

spin_lock(&curr->pi_lock);
- prio = rt_mutex_getprio(curr);
- task_setprio(curr, prio);

curr->rcub_rbdp = NULL;

@@ -393,9 +385,7 @@ static int __rcu_boost_readers(struct rcu_boost_dat *rbd, int prio, unsigned lon
list_move_tail(&p->rcub_entry,
&rbd->rbs_boosted);
set_rcu_prio(p, prio);
- spin_lock(&p->pi_lock);
- rcu_boost_task(p);
- spin_unlock(&p->pi_lock);
+ task_pi_boost(p, &p->rcu_prio_src, 0);

/*
* Now we release the lock to allow for a higher
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 377949a..7d11380 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -178,8 +178,10 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
{
int prio = rt_mutex_getprio(task);

- if (task->prio != prio)
- rt_mutex_setprio(task, prio);
+ if (task->rtmutex_prio != prio) {
+ task->rtmutex_prio = prio;
+ task_pi_boost(task, &task->rtmutex_prio_src, 0);
+ }
}

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 54ea580..c129b10 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1709,26 +1709,6 @@ static inline int normal_prio(struct task_struct *p)
}

/*
- * Calculate the current priority, i.e. the priority
- * taken into account by the scheduler. This value might
- * be boosted by RT tasks, or might be boosted by
- * interactivity modifiers. Will be RT if the task got
- * RT-boosted. If not then it returns p->normal_prio.
- */
-static int effective_prio(struct task_struct *p)
-{
- p->normal_prio = normal_prio(p);
- /*
- * If we are RT tasks or we were boosted to RT priority,
- * keep the priority unchanged. Otherwise, update priority
- * to the normal priority:
- */
- if (!rt_prio(p->prio))
- return p->normal_prio;
- return p->prio;
-}
-
-/*
* activate_task - move a task to the runqueue.
*/
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -2375,6 +2355,58 @@ static void __sched_fork(struct task_struct *p)
p->state = TASK_RUNNING;
}

+static int
+task_pi_boost_cb(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+
+ /*
+ * We dont need any locking here, since the .boost operation
+ * is already guaranteed to be mutually exclusive
+ */
+ p->pi.prio = *src->prio;
+
+ return 0;
+}
+
+static int task_pi_update_cb(struct pi_sink *snk, unsigned int flags);
+
+static struct pi_sink task_pi_sink = {
+ .boost = task_pi_boost_cb,
+ .update = task_pi_update_cb,
+};
+
+static inline void
+task_pi_init(struct task_struct *p)
+{
+ pi_node_init(&p->pi.node);
+
+ /*
+ * Feed our initial state of normal_prio into the PI infrastructure.
+ * We will update this whenever it changes
+ */
+ p->pi.prio = p->normal_prio;
+ pi_source_init(&p->pi.src, &p->normal_prio);
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+
+#ifdef CONFIG_RT_MUTEXES
+ p->rtmutex_prio = MAX_PRIO;
+ pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
+ task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
+#endif
+
+ /*
+ * We add our own task as a dependency of ourselves so that
+ * we get boost-notifications (via task_pi_boost_cb) whenever
+ * our priority is changed (locally e.g. setscheduler() or
+ * remotely via a pi-boost).
+ */
+ p->pi.snk = task_pi_sink;
+ pi_add_sink(&p->pi.node, &p->pi.snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}
+
/*
* fork()/clone()-time setup:
*/
@@ -2396,6 +2428,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

+ task_pi_init(p);
+
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
if (likely(sched_info_on()))
memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -2411,6 +2445,55 @@ void sched_fork(struct task_struct *p, int clone_flags)
}

/*
+ * In the past, task_setprio was exposed as an API. This variant is only
+ * meant to be called from pi_update functions (namely, task_updateprio() and
+ * task_pi_update_cb()). If you need to adjust the priority of a task,
+ * you should be using something like setscheduler() (permanent adjustments)
+ * or task_pi_boost() (temporary adjustments).
+ */
+static void
+task_setprio(struct task_struct *p, int prio)
+{
+ if (prio == p->prio)
+ return;
+
+ if (rt_prio(prio))
+ p->sched_class = &rt_sched_class;
+ else
+ p->sched_class = &fair_sched_class;
+
+ p->prio = prio;
+}
+
+static inline void
+task_updateprio(struct task_struct *p)
+{
+ int prio = normal_prio(p);
+
+ if (p->normal_prio != prio) {
+ p->normal_prio = prio;
+ set_load_weight(p);
+
+ /*
+ * Reboost our normal_prio entry, which will
+ * also chain-update any of our PI dependencies (of course)
+ * on our next update
+ */
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+ }
+
+ /*
+ * If normal_prio is logically higher than our current setting,
+ * just assign the priority/class immediately so that any callers
+ * will see the update as synchronous without dropping the rq-lock
+ * to do a pi_update. Any descrepancy with pending pi-updates will
+ * automatically be corrected after we drop the rq-lock.
+ */
+ if (p->normal_prio < p->prio)
+ task_setprio(p, p->normal_prio);
+}
+
+/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
@@ -2426,7 +2509,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
BUG_ON(p->state != TASK_RUNNING);
update_rq_clock(rq);

- p->prio = effective_prio(p);
+ task_updateprio(p);

if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);
@@ -2447,6 +2530,8 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
p->sched_class->task_wake_up(rq, p);
#endif
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -4887,27 +4972,25 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
EXPORT_SYMBOL(sleep_on_timeout);

/*
- * task_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * Invoked whenever our priority changes by the PI library
*
* This function changes the 'effective' priority of a task. It does
* not touch ->normal_prio like __setscheduler().
*
- * Used by the rt_mutex code to implement priority inheritance logic
- * and by rcupreempt-boost to boost priorities of tasks sleeping
- * with rcu locks.
*/
-void task_setprio(struct task_struct *p, int prio)
+static int
+task_pi_update_cb(struct pi_sink *snk, unsigned int flags)
{
- unsigned long flags;
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+ unsigned long iflags;
int oldprio, on_rq, running;
+ int prio = p->pi.prio;
struct rq *rq;
const struct sched_class *prev_class = p->sched_class;

BUG_ON(prio < 0 || prio > MAX_PRIO);

- rq = task_rq_lock(p, &flags);
+ rq = task_rq_lock(p, &iflags);

/*
* Idle task boosting is a nono in general. There is one
@@ -4929,6 +5012,10 @@ void task_setprio(struct task_struct *p, int prio)

update_rq_clock(rq);

+ /* If prio is not changing, bail */
+ if (prio == p->prio)
+ goto out_unlock;
+
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_current(rq, p);
@@ -4937,12 +5024,7 @@ void task_setprio(struct task_struct *p, int prio)
if (running)
p->sched_class->put_prev_task(rq, p);

- if (rt_prio(prio))
- p->sched_class = &rt_sched_class;
- else
- p->sched_class = &fair_sched_class;
-
- p->prio = prio;
+ task_setprio(p, prio);

// trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p));

@@ -4956,7 +5038,9 @@ void task_setprio(struct task_struct *p, int prio)
// trace_special(prev_resched, _need_resched(), 0);

out_unlock:
- task_rq_unlock(rq, &flags);
+ task_rq_unlock(rq, &iflags);
+
+ return 0;
}

void set_user_nice(struct task_struct *p, long nice)
@@ -4990,9 +5074,9 @@ void set_user_nice(struct task_struct *p, long nice)
}

p->static_prio = NICE_TO_PRIO(nice);
- set_load_weight(p);
old_prio = p->prio;
- p->prio = effective_prio(p);
+ task_updateprio(p);
+
delta = p->prio - old_prio;

if (on_rq) {
@@ -5007,6 +5091,8 @@ void set_user_nice(struct task_struct *p, long nice)
}
out_unlock:
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}
EXPORT_SYMBOL(set_user_nice);

@@ -5123,23 +5209,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
BUG_ON(p->se.on_rq);

p->policy = policy;
- switch (p->policy) {
- case SCHED_NORMAL:
- case SCHED_BATCH:
- case SCHED_IDLE:
- p->sched_class = &fair_sched_class;
- break;
- case SCHED_FIFO:
- case SCHED_RR:
- p->sched_class = &rt_sched_class;
- break;
- }
-
p->rt_priority = prio;
- p->normal_prio = normal_prio(p);
- /* we are holding p->pi_lock already */
- p->prio = rt_mutex_getprio(p);
- set_load_weight(p);
+
+ task_updateprio(p);
}

/**
@@ -5264,6 +5336,7 @@ recheck:
__task_rq_unlock(rq);
spin_unlock_irqrestore(&p->pi_lock, flags);

+ task_pi_update(p, 0);
rt_mutex_adjust_pi(p);

return 0;
@@ -6686,6 +6759,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
deactivate_task(rq, rq->idle, 0);
rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
+ rq->idle->prio = rq->idle->normal_prio;
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
@@ -8395,6 +8469,8 @@ void __init sched_init(void)
open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
#endif

+ task_pi_init(&init_task);
+
#ifdef CONFIG_RT_MUTEXES
plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
#endif
@@ -8460,7 +8536,9 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(rq, p, 0);
+
__setscheduler(rq, p, SCHED_NORMAL, 0);
+
if (on_rq) {
activate_task(rq, p, 0);
resched_task(rq->curr);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f37979..5cd4b0e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -145,8 +145,13 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
plist_node_init(&work->entry, prio);
plist_add(&work->entry, &cwq->worklist);

- if (boost_prio < cwq->thread->prio)
- task_setprio(cwq->thread, boost_prio);
+ /*
+ * FIXME: We want to boost to boost_prio, but we dont record that
+ * value in the work_struct for later deboosting
+ */
+ pi_source_init(&work->pi_src, &work->entry.prio);
+ task_pi_boost(cwq->thread, &work->pi_src, 0);
+
wake_up(&cwq->more_work);
}

@@ -280,6 +285,10 @@ struct wq_barrier {
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
struct plist_head *worklist = &cwq->worklist;
+ struct pi_source pi_src;
+ int prio;
+
+ pi_source_init(&pi_src, &prio);

spin_lock_irq(&cwq->lock);
cwq->run_depth++;
@@ -292,10 +301,10 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)

again:
while (!plist_head_empty(worklist)) {
- int prio;
struct work_struct *work = plist_first_entry(worklist,
struct work_struct, entry);
work_func_t f = work->func;
+
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct
@@ -316,14 +325,28 @@ again:
}
prio = max(prio, 0);

- if (likely(cwq->thread->prio != prio))
- task_setprio(cwq->thread, prio);
-
cwq->current_work = work;
plist_del(&work->entry, worklist);
plist_node_init(&work->entry, MAX_PRIO);
spin_unlock_irq(&cwq->lock);

+ /*
+ * The owner is free to reuse the work object once we execute
+ * the work->func() below. Therefore we cannot leave the
+ * work->pi_src boosting our thread or it may get stomped
+ * on when the work item is requeued.
+ *
+ * So what we do is we boost ourselves with an on-the
+ * stack copy of the priority of the work item, and then
+ * deboost the workitem. Once the work is complete, we
+ * can then simply deboost the stack version.
+ *
+ * Note that this will not typically cause a pi-chain
+ * update since we are boosting the node laterally
+ */
+ task_pi_boost(current, &pi_src, PI_FLAG_DEFER_UPDATE);
+ task_pi_deboost(current, &work->pi_src, PI_FLAG_DEFER_UPDATE);
+
BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
leak_check(NULL);
@@ -334,6 +357,9 @@ again:
lock_release(&cwq->wq->lockdep_map, 1, _THIS_IP_);
leak_check(f);

+ /* Deboost the stack copy of the work->prio (see above) */
+ task_pi_deboost(current, &pi_src, 0);
+
spin_lock_irq(&cwq->lock);
cwq->current_work = NULL;
wake_up_all(&cwq->work_done);
@@ -357,7 +383,6 @@ again:
goto again;
}

- task_setprio(cwq->thread, current->normal_prio);
cwq->run_depth--;
spin_unlock_irq(&cwq->lock);
}

2008-08-01 21:24:43

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 3/7] rtmutex: formally initialize the rt_mutex_waiters

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 26 ++++++++++++++------------
1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
}
#endif

+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+ memset(waiter, 0, sizeof(*waiter));
+
+ debug_rt_mutex_init_waiter(waiter);
+ waiter->task = NULL;
+ waiter->write_lock = 0;
+}
+
/*
* Slow path lock function spin_lock style: this variant is very
* careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
struct task_struct *orig_owner;
int missed = 0;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long saved_state = -1, state, flags;

+ init_waiter(&waiter);
+
spin_lock_irqsave(&mutex->wait_lock, flags);
init_rw_lists(rwm);

@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)

/* Owner is a writer (or a blocked writer). Block on the lock */

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
-
if (mtx) {
/*
* We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long flags, saved_state = -1, state;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
+ init_waiter(&waiter);

/* we do PI different for writers that are blocked */
waiter.write_lock = 1;
@@ -2270,9 +2274,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter waiter;
unsigned long flags;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);

2008-08-01 21:24:59

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 4/7] RT: wrap the rt_rwlock "add reader" logic

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
# define rw_check_held(rwm) do { } while (0)
#endif

+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+ list_add(&rls->list, &rwm->readers);
+}
+
/*
* The fast path does not add itself to the reader list to keep
* from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
if (rls->list.prev && !list_empty(&rls->list))
return;

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);

/* change to reader, so no one else updates too */
rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
* it hasn't been added to the link list yet.
*/
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
rls->count++;
incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rls->lock = rwm;
rls->count = 1;
WARN_ON(rls->list.prev && !list_empty(&rls->list));
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rlw, rwm);
spin_unlock(&mutex->wait_lock);
} else
spin_unlock(&current->pi_lock);
@@ -2083,7 +2089,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)

/* Set us up for multiple readers or conflicts */

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rwm->owner = RT_RW_READER;

/*

2008-08-01 21:25:26

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 5/7] rtmutex: use runtime init for rtmutexes

The system already has facilities to perform late/run-time init for
rtmutexes. We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rtmutex.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;

#define __RT_MUTEX_INITIALIZER(mutexname) \
{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
- , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
- , .owner = NULL \
__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}

#define DEFINE_RT_MUTEX(mutexname) \

2008-08-01 21:25:51

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 6/7] rtmutex: convert rtmutexes to fully use the PI library

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code. This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 15 -
include/linux/sched.h | 21 -
kernel/fork.c | 2
kernel/rcupreempt-boost.c | 2
kernel/rtmutex-debug.c | 4
kernel/rtmutex.c | 944 ++++++++++++++-------------------------------
kernel/rtmutex_common.h | 18 -
kernel/rwlock_torture.c | 32 --
kernel/sched.c | 12 -
10 files changed, 319 insertions(+), 733 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..d0ef0f1 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
#include <asm/atomic.h>
#include <linux/spinlock_types.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>

#ifdef CONFIG_PREEMPT_RT
/*
@@ -67,6 +68,7 @@ struct rw_mutex {
atomic_t count; /* number of times held for read */
atomic_t owners; /* number of owners as readers */
struct list_head readers;
+ struct pi_sink pi_snk;
int prio;
};

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..d984244 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
#include <linux/linkage.h>
#include <linux/plist.h>
#include <linux/spinlock_types.h>
+#include <linux/pi.h>

/**
* The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
raw_spinlock_t wait_lock;
struct plist_head wait_list;
struct task_struct *owner;
+ struct {
+ struct pi_source src;
+ struct pi_node node;
+ struct pi_sink snk;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
const char *name, *file;
@@ -96,12 +103,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);

extern void rt_mutex_unlock(struct rt_mutex *lock);

-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk) \
- .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, &tsk.pi_lock), \
- INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63ddd1f..7af6c3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1106,6 +1106,7 @@ struct reader_lock_struct {
struct rw_mutex *lock;
struct list_head list;
struct task_struct *task;
+ struct pi_source pi_src;
int count;
};

@@ -1307,15 +1308,6 @@ struct task_struct {
int prio;
} pi;

-#ifdef CONFIG_RT_MUTEXES
- /* PI waiters blocked on a rt_mutex held by this task */
- struct plist_head pi_waiters;
- /* Deadlock detection and priority inheritance handling */
- struct rt_mutex_waiter *pi_blocked_on;
- int rtmutex_prio;
- struct pi_source rtmutex_prio_src;
-#endif
-
#ifdef CONFIG_DEBUG_MUTEXES
/* mutex deadlock detection */
struct mutex_waiter *blocked_on;
@@ -1805,17 +1797,6 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-#ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_adjust_pi(struct task_struct *p);
-#else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
- return p->normal_prio;
-}
-# define rt_mutex_adjust_pi(p) do { } while (0)
-#endif
-
extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..80ca71d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -887,8 +887,6 @@ static void rt_mutex_init_task(struct task_struct *p)
{
spin_lock_init(&p->pi_lock);
#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&p->pi_waiters, &p->pi_lock);
- p->pi_blocked_on = NULL;
# ifdef CONFIG_DEBUG_RT_MUTEXES
p->last_kernel_lock = NULL;
# endif
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index 9373b9e..f3eeca3 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -431,7 +431,7 @@ void rcu_boost_readers(void)

spin_lock_irqsave(&rcu_boost_wake_lock, flags);

- prio = rt_mutex_getprio(curr);
+ prio = get_rcu_prio(curr);

rcu_trace_boost_try_boost_readers(RCU_BOOST_ME);

diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 0d9cb54..2034ce1 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -57,8 +57,6 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)

void rt_mutex_debug_task_free(struct task_struct *task)
{
- DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
- DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
#ifdef CONFIG_PREEMPT_RT
WARN_ON(task->reader_lock_count);
#endif
@@ -156,7 +154,6 @@ void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
plist_node_init(&waiter->list_entry, MAX_PRIO);
- plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
waiter->deadlock_task_pid = NULL;
}

@@ -164,7 +161,6 @@ void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
{
put_pid(waiter->deadlock_task_pid);
DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
- DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
DEBUG_LOCKS_WARN_ON(waiter->task);
memset(waiter, 0x22, sizeof(*waiter));
}
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 62fdc3d..0f64298 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -58,14 +58,32 @@
* state.
*/

+static inline void
+rtmutex_pi_owner(struct rt_mutex *lock, struct task_struct *p, int add)
+{
+ if (!p || p == RT_RW_READER)
+ return;
+
+ if (add)
+ task_pi_boost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+ else
+ task_pi_deboost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+}
+
static void
rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
unsigned long mask)
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock))
+ if (rt_mutex_has_waiters(lock)) {
+ struct task_struct *prev_owner = rt_mutex_owner(lock);
+
+ rtmutex_pi_owner(lock, prev_owner, 0);
+ rtmutex_pi_owner(lock, owner, 1);
+
val |= RT_MUTEX_HAS_WAITERS;
+ }

lock->owner = (struct task_struct *)val;
}
@@ -134,245 +152,88 @@ static inline int task_is_reader(struct task_struct *task) { return 0; }
#endif

int pi_initialized;
-
-/*
- * we initialize the wait_list runtime. (Could be done build-time and/or
- * boot-time.)
- */
-static inline void init_lists(struct rt_mutex *lock)
+static inline int rtmutex_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- if (unlikely(!lock->wait_list.prio_list.prev)) {
- plist_head_init(&lock->wait_list, &lock->wait_lock);
-#ifdef CONFIG_DEBUG_RT_MUTEXES
- pi_initialized++;
-#endif
- }
-}
-
-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio);
-
-/*
- * Calculate task priority from the waiter list priority
- *
- * Return task->normal_prio when the waiter list is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
- int prio = min(task->normal_prio, get_rcu_prio(task));
-
- prio = rt_mutex_get_readers_prio(task, prio);
-
- if (likely(!task_has_pi_waiters(task)))
- return prio;
-
- return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
-}
+ struct rt_mutex *lock = container_of(snk, struct rt_mutex, pi.snk);

-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
- int prio = rt_mutex_getprio(task);
-
- if (task->rtmutex_prio != prio) {
- task->rtmutex_prio = prio;
- task_pi_boost(task, &task->rtmutex_prio_src, 0);
- }
-}
-
-/*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-static void rt_mutex_adjust_prio(struct task_struct *task)
-{
- unsigned long flags;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ lock->pi.prio = *src->prio;

- spin_lock_irqsave(&task->pi_lock, flags);
- __rt_mutex_adjust_prio(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
+ return 0;
}

-/*
- * Max number of times we'll walk the boosting chain:
- */
-int max_lock_depth = 1024;
-
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth);
-/*
- * Adjust the priority chain. Also used for deadlock detection.
- * Decreases task's usage by one - may thus free the task.
- * Returns 0 or -EDEADLK.
- */
-static int rt_mutex_adjust_prio_chain(struct task_struct *task,
- int deadlock_detect,
- struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- int recursion_depth)
+static inline int rtmutex_pi_update(struct pi_sink *snk,
+ unsigned int flags)
{
- struct rt_mutex *lock;
- struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
- int detect_deadlock, ret = 0, depth = 0;
- unsigned long flags;
+ struct rt_mutex *lock = container_of(snk, struct rt_mutex, pi.snk);
+ struct task_struct *owner = NULL;
+ unsigned long iflags;

- detect_deadlock = debug_rt_mutex_detect_deadlock(orig_waiter,
- deadlock_detect);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- /*
- * The (de)boosting is a step by step approach with a lot of
- * pitfalls. We want this to be preemptible and we want hold a
- * maximum of two locks per step. So we have to check
- * carefully whether things change under us.
- */
- again:
- if (++depth > max_lock_depth) {
- static int prev_max;
+ if (rt_mutex_has_waiters(lock)) {
+ owner = rt_mutex_owner(lock);

- /*
- * Print this only once. If the admin changes the limit,
- * print a new message when reaching the limit again.
- */
- if (prev_max != max_lock_depth) {
- prev_max = max_lock_depth;
- printk(KERN_WARNING "Maximum lock depth %d reached "
- "task: %s (%d)\n", max_lock_depth,
- top_task->comm, task_pid_nr(top_task));
+ if (owner && owner != RT_RW_READER) {
+ rtmutex_pi_owner(lock, owner, 1);
+ get_task_struct(owner);
}
- put_task_struct(task);
-
- return deadlock_detect ? -EDEADLK : 0;
}
- retry:
- /*
- * Task can not go away as we did a get_task() before !
- */
- spin_lock_irqsave(&task->pi_lock, flags);

- waiter = task->pi_blocked_on;
- /*
- * Check whether the end of the boosting chain has been
- * reached or the state of the chain has changed while we
- * dropped the locks.
- */
- if (!waiter || !waiter->task)
- goto out_unlock_pi;
-
- /*
- * Check the orig_waiter state. After we dropped the locks,
- * the previous owner of the lock might have released the lock
- * and made us the pending owner:
- */
- if (orig_waiter && !orig_waiter->task)
- goto out_unlock_pi;
-
- /*
- * Drop out, when the task has no waiters. Note,
- * top_waiter can be NULL, when we are in the deboosting
- * mode!
- */
- if (top_waiter && (!task_has_pi_waiters(task) ||
- top_waiter != task_top_pi_waiter(task)))
- goto out_unlock_pi;
-
- /*
- * When deadlock detection is off then we check, if further
- * priority adjustment is necessary.
- */
- if (!detect_deadlock && waiter->list_entry.prio == task->prio)
- goto out_unlock_pi;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);

- lock = waiter->lock;
- if (!spin_trylock(&lock->wait_lock)) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- cpu_relax();
- goto retry;
+ if (owner && owner != RT_RW_READER) {
+ task_pi_update(owner, 0);
+ put_task_struct(owner);
}

- /* Deadlock detection */
- if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
- debug_rt_mutex_deadlock(deadlock_detect, orig_waiter, lock);
- spin_unlock(&lock->wait_lock);
- ret = deadlock_detect ? -EDEADLK : 0;
- goto out_unlock_pi;
- }
+ return 0;
+}

- top_waiter = rt_mutex_top_waiter(lock);
+static struct pi_sink rtmutex_pi_snk = {
+ .boost = rtmutex_pi_boost,
+ .update = rtmutex_pi_update,
+};

- /* Requeue the waiter */
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->list_entry.prio = task->prio;
- plist_add(&waiter->list_entry, &lock->wait_list);
-
- /* Release the task */
- spin_unlock(&task->pi_lock);
- put_task_struct(task);
+static void init_pi(struct rt_mutex *lock)
+{
+ pi_node_init(&lock->pi.node);

- /* Grab the next task */
- task = rt_mutex_owner(lock);
+ lock->pi.prio = MAX_PRIO;
+ pi_source_init(&lock->pi.src, &lock->pi.prio);
+ lock->pi.snk = rtmutex_pi_snk;

- /*
- * Readers are special. We may need to boost more than one owner.
- */
- if (task_is_reader(task)) {
- ret = rt_mutex_adjust_readers(orig_lock, orig_waiter,
- top_task, lock,
- recursion_depth);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
- goto out;
- }
+ pi_add_sink(&lock->pi.node, &lock->pi.snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}

- get_task_struct(task);
- spin_lock(&task->pi_lock);
-
- if (waiter == rt_mutex_top_waiter(lock)) {
- /* Boost the owner */
- plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
-
- } else if (top_waiter == waiter) {
- /* Deboost the owner */
- plist_del(&waiter->pi_list_entry, &task->pi_waiters);
- waiter = rt_mutex_top_waiter(lock);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
+/*
+ * we initialize the wait_list runtime. (Could be done build-time and/or
+ * boot-time.)
+ */
+static inline void init_lists(struct rt_mutex *lock)
+{
+ if (unlikely(!lock->wait_list.prio_list.prev)) {
+ plist_head_init(&lock->wait_list, &lock->wait_lock);
+ init_pi(lock);
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+ pi_initialized++;
+#endif
}
-
- spin_unlock(&task->pi_lock);
-
- top_waiter = rt_mutex_top_waiter(lock);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- if (!detect_deadlock && waiter != top_waiter)
- goto out_put_task;
-
- goto again;
-
- out_unlock_pi:
- spin_unlock_irqrestore(&task->pi_lock, flags);
- out_put_task:
- put_task_struct(task);
- out:
- return ret;
}

/*
+ * Max number of times we'll walk the boosting chain:
+ */
+int max_lock_depth = 1024;
+
+/*
* Optimization: check if we can steal the lock from the
* assigned pending owner [which might not have taken the
* lock yet]:
@@ -380,7 +241,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *next;

if (!rt_mutex_owner_pending(lock))
return 0;
@@ -390,49 +250,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)

WARN_ON(task_is_reader(rt_mutex_owner(lock)));

- spin_lock(&pendowner->pi_lock);
- if (!lock_is_stealable(pendowner, mode)) {
- spin_unlock(&pendowner->pi_lock);
- return 0;
- }
-
- /*
- * Check if a waiter is enqueued on the pending owners
- * pi_waiters list. Remove it and readjust pending owners
- * priority.
- */
- if (likely(!rt_mutex_has_waiters(lock))) {
- spin_unlock(&pendowner->pi_lock);
- return 1;
- }
-
- /* No chain handling, pending owner is not blocked on anything: */
- next = rt_mutex_top_waiter(lock);
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
- __rt_mutex_adjust_prio(pendowner);
- spin_unlock(&pendowner->pi_lock);
-
- /*
- * We are going to steal the lock and a waiter was
- * enqueued on the pending owners pi_waiters queue. So
- * we have to enqueue this waiter into
- * current->pi_waiters list. This covers the case,
- * where current is boosted because it holds another
- * lock and gets unboosted because the booster is
- * interrupted, so we would delay a waiter with higher
- * priority as current->normal_prio.
- *
- * Note: in the rare case of a SCHED_OTHER task changing
- * its priority and thus stealing the lock, next->task
- * might be current:
- */
- if (likely(next->task != current)) {
- spin_lock(&current->pi_lock);
- plist_add(&next->pi_list_entry, &current->pi_waiters);
- __rt_mutex_adjust_prio(current);
- spin_unlock(&current->pi_lock);
- }
- return 1;
+ return lock_is_stealable(pendowner, mode);
}

/*
@@ -486,74 +304,145 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
}

/*
- * Task blocks on lock.
- *
- * Prepare waiter and propagate pi chain
- *
- * This must be called with lock->wait_lock held.
+ * These callbacks are invoked whenever a waiter has changed priority.
+ * So we should requeue it within the lock->wait_list
*/
-static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- int detect_deadlock, unsigned long flags)
+
+static inline int rtmutex_waiter_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct task_struct *owner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *top_waiter = waiter;
- int chain_walk = 0, res;
+ struct rt_mutex_waiter *waiter;

- spin_lock(&current->pi_lock);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
- waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ waiter = container_of(snk, struct rt_mutex_waiter, pi.snk);

- /* Get the top priority waiter on the lock */
- if (rt_mutex_has_waiters(lock))
- top_waiter = rt_mutex_top_waiter(lock);
- plist_add(&waiter->list_entry, &lock->wait_list);
+ /*
+ * We dont need to take any locks here because the
+ * waiter->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ waiter->pi.prio = *src->prio;

- current->pi_blocked_on = waiter;
+ return 0;
+}

- spin_unlock(&current->pi_lock);
+static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
+ unsigned int flags)
+{
+ struct rt_mutex *lock;
+ struct rt_mutex_waiter *waiter;
+ unsigned long iflags;

- if (waiter == rt_mutex_top_waiter(lock)) {
- /* readers are handled differently */
- if (task_is_reader(owner)) {
- res = rt_mutex_adjust_readers(lock, waiter,
- current, lock, 0);
- return res;
- }
+ waiter = container_of(snk, struct rt_mutex_waiter, pi.snk);
+ lock = waiter->lock;

- spin_lock(&owner->pi_lock);
- plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
- plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- __rt_mutex_adjust_prio(owner);
- if (owner->pi_blocked_on)
- chain_walk = 1;
- spin_unlock(&owner->pi_lock);
+ /*
+ * If waiter->task is non-NULL, it means we are still valid in the
+ * pi list. Therefore, if waiter->pi.prio has changed since we
+ * queued ourselves, requeue it.
+ */
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
}
- else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
- chain_walk = 1;

- if (!chain_walk || task_is_reader(owner))
- return 0;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);
+
+ return 0;
+}
+
+static struct pi_sink rtmutex_waiter_pi_snk = {
+ .boost = rtmutex_waiter_pi_boost,
+ .update = rtmutex_waiter_pi_update,
+};
+
+/*
+ * This must be called with lock->wait_lock held.
+ */
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ int has_waiters = rt_mutex_has_waiters(lock);
+
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+ waiter->pi.snk = rtmutex_waiter_pi_snk;

/*
- * The owner can't disappear while holding a lock,
- * so the owner struct is protected by wait_lock.
- * Gets dropped in rt_mutex_adjust_prio_chain()!
+ * Link the waiter object to the task so that we can adjust our
+ * position on the prio list if the priority is changed. Note
+ * that if the priority races between the time we recorded it
+ * above and the time it is set here, we will correct the race
+ * when we task_pi_update(current) below. Otherwise the the
+ * update is a no-op
*/
- get_task_struct(owner);
+ pi_add_sink(&current->pi.node, &waiter->pi.snk,
+ PI_FLAG_DEFER_UPDATE);

- spin_unlock_irqrestore(&lock->wait_lock, flags);
+ /*
+ * Link the lock object to the waiter so that we can form a chain
+ * to the owner
+ */
+ pi_add_sink(&current->pi.node, &lock->pi.node.snk,
+ PI_FLAG_DEFER_UPDATE);

- res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current, 0);
+ /*
+ * If we previously had no waiters, we are transitioning to
+ * a mode where we need to boost the owner
+ */
+ if (!has_waiters) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 1);
+ }

- spin_lock_irq(&lock->wait_lock);
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ /*
+ * We can stop boosting the owner if there are no more waiters
+ */
+ if (!rt_mutex_has_waiters(lock)) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 0);
+ }

- return res;
+ /*
+ * Unlink the lock object from the waiter
+ */
+ pi_del_sink(&p->pi.node, &lock->pi.node.snk, PI_FLAG_DEFER_UPDATE);
+
+ /*
+ * Unlink the waiter object from the task. Note that we
+ * technically do not need an update for "p" because the
+ * .deboost will be processed synchronous to this call
+ * since there is no .deboost handler registered for
+ * the waiter sink
+ */
+ pi_del_sink(&p->pi.node, &waiter->pi.snk, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -566,24 +455,10 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
*/
static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
- struct rt_mutex_waiter *waiter;
- struct task_struct *pendowner;
- struct rt_mutex_waiter *next;
-
- spin_lock(&current->pi_lock);
+ struct rt_mutex_waiter *waiter = rt_mutex_top_waiter(lock);
+ struct task_struct *pendowner = waiter->task;

- waiter = rt_mutex_top_waiter(lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
-
- /*
- * Remove it from current->pi_waiters. We do not adjust a
- * possible priority boost right now. We execute wakeup in the
- * boosted mode and go back to normal after releasing
- * lock->wait_lock.
- */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- pendowner = waiter->task;
- waiter->task = NULL;
+ remove_waiter(lock, waiter);

/*
* Do the wakeup before the ownership change to give any spinning
@@ -621,113 +496,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
}

rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
-
- spin_unlock(&current->pi_lock);
-
- /*
- * Clear the pi_blocked_on variable and enqueue a possible
- * waiter into the pi_waiters list of the pending owner. This
- * prevents that in case the pending owner gets unboosted a
- * waiter with higher priority than pending-owner->normal_prio
- * is blocked on the unboosted (pending) owner.
- */
-
- if (rt_mutex_has_waiters(lock))
- next = rt_mutex_top_waiter(lock);
- else
- next = NULL;
-
- spin_lock(&pendowner->pi_lock);
-
- WARN_ON(!pendowner->pi_blocked_on);
- WARN_ON(pendowner->pi_blocked_on != waiter);
- WARN_ON(pendowner->pi_blocked_on->lock != lock);
-
- pendowner->pi_blocked_on = NULL;
-
- if (next)
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
-
- spin_unlock(&pendowner->pi_lock);
-}
-
-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long flags)
-{
- int first = (waiter == rt_mutex_top_waiter(lock));
- struct task_struct *owner = rt_mutex_owner(lock);
- int chain_walk = 0;
-
- spin_lock(&current->pi_lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
- current->pi_blocked_on = NULL;
- spin_unlock(&current->pi_lock);
-
- if (first && owner != current && !task_is_reader(owner)) {
-
- spin_lock(&owner->pi_lock);
-
- plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
-
- if (rt_mutex_has_waiters(lock)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(lock);
- plist_add(&next->pi_list_entry, &owner->pi_waiters);
- }
- __rt_mutex_adjust_prio(owner);
-
- if (owner->pi_blocked_on)
- chain_walk = 1;
-
- spin_unlock(&owner->pi_lock);
- }
-
- WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
- if (!chain_walk)
- return;
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(owner);
-
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current, 0);
-
- spin_lock_irq(&lock->wait_lock);
-}
-
-/*
- * Recheck the pi chain, in case we got a priority setting
- *
- * Called from sched_setscheduler
- */
-void rt_mutex_adjust_pi(struct task_struct *task)
-{
- struct rt_mutex_waiter *waiter;
- unsigned long flags;
-
- spin_lock_irqsave(&task->pi_lock, flags);
-
- waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- return;
- }
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
-
- rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task, 0);
}

/*
@@ -869,7 +637,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* but the lock got stolen by an higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(lock, &waiter, 0, flags);
+ add_waiter(lock, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -917,7 +685,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* can end up with a non-NULL waiter.task:
*/
if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);
/*
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up:
@@ -927,6 +695,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unlock:
spin_unlock_irqrestore(&lock->wait_lock, flags);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -954,8 +725,8 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
@@ -1126,6 +897,9 @@ static inline void
rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
{
list_add(&rls->list, &rwm->readers);
+
+ pi_source_init(&rls->pi_src, &rwm->prio);
+ task_pi_boost(rls->task, &rls->pi_src, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -1249,21 +1023,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
waiter = rt_mutex_top_waiter(mutex);
if (!lock_is_stealable(waiter->task, mode))
return 0;
- /*
- * The pending reader has PI waiters,
- * but we are taking the lock.
- * Remove the waiters from the pending owner.
- */
- spin_lock(&mtxowner->pi_lock);
- plist_del(&waiter->pi_list_entry, &mtxowner->pi_waiters);
- spin_unlock(&mtxowner->pi_lock);
}
- } else if (rt_mutex_has_waiters(mutex)) {
- /* Readers do things differently with respect to PI */
- waiter = rt_mutex_top_waiter(mutex);
- spin_lock(&current->pi_lock);
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
}
/* Readers never own the mutex */
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
@@ -1275,7 +1035,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
if (incr) {
atomic_inc(&rwm->owners);
rw_check_held(rwm);
- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
rls = &current->owned_read_locks[reader_count];
@@ -1285,10 +1045,11 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();
}
rt_mutex_deadlock_account_lock(mutex, current);
atomic_inc(&rwm->count);
+
return 1;
}

@@ -1378,7 +1139,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1417,7 +1178,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

WARN_ON(rt_mutex_owner(mutex) &&
rt_mutex_owner(mutex) != current &&
@@ -1430,6 +1191,9 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -1457,13 +1221,13 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);
local_irq_save(flags);
- spin_lock(&current->pi_lock);
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
current->owned_read_locks[reader_count].lock = rwm;
current->owned_read_locks[reader_count].count = 1;
} else
WARN_ON_ONCE(1);
+
/*
* If this task is no longer the sole owner of the lock
* or someone is blocking, then we need to add the task
@@ -1473,16 +1237,12 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
struct rt_mutex *mutex = &rwm->mutex;
struct reader_lock_struct *rls;

- /* preserve lock order, we only need wait_lock now */
- spin_unlock(&current->pi_lock);
-
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- rt_rwlock_add_reader(rlw, rwm);
+ rt_rwlock_add_reader(rls, rwm);
spin_unlock(&mutex->wait_lock);
- } else
- spin_unlock(&current->pi_lock);
+ }
local_irq_restore(flags);
return 1;
}
@@ -1591,7 +1351,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1630,7 +1390,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

/* check on unlock if we have any waiters. */
if (rt_mutex_has_waiters(mutex))
@@ -1642,6 +1402,9 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);

}
@@ -1733,7 +1496,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

for (i = current->reader_lock_count - 1; i >= 0; i--) {
if (current->owned_read_locks[i].lock == rwm) {
- spin_lock(&current->pi_lock);
+ preempt_disable();
current->owned_read_locks[i].count--;
if (!current->owned_read_locks[i].count) {
current->reader_lock_count--;
@@ -1743,9 +1506,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
WARN_ON(!rls->list.prev || list_empty(&rls->list));
list_del_init(&rls->list);
rls->lock = NULL;
+ task_pi_deboost(current, &rls->pi_src,
+ PI_FLAG_DEFER_UPDATE);
rw_check_held(rwm);
}
- spin_unlock(&current->pi_lock);
+ preempt_enable();
break;
}
}
@@ -1776,7 +1541,6 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

/* If no one is blocked, then clear all ownership */
if (!rt_mutex_has_waiters(mutex)) {
- rwm->prio = MAX_PRIO;
/*
* If count is not zero, we are under the limit with
* no other readers.
@@ -1835,28 +1599,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
}

- if (rt_mutex_has_waiters(mutex)) {
- waiter = rt_mutex_top_waiter(mutex);
- rwm->prio = waiter->task->prio;
- /*
- * If readers still own this lock, then we need
- * to update the pi_list too. Readers have a separate
- * path in the PI chain.
- */
- if (reader_count) {
- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->pi_list_entry,
- &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
- }
- } else
- rwm->prio = MAX_PRIO;
-
out:
spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -1874,9 +1621,9 @@ rt_read_fastunlock(struct rw_mutex *rwm,
int reader_count;
int owners;

- spin_lock_irqsave(&current->pi_lock, flags);
+ local_irq_save(flags);
reader_count = --current->reader_lock_count;
- spin_unlock_irqrestore(&current->pi_lock, flags);
+ local_irq_restore(flags);

rt_mutex_deadlock_account_unlock(current);
if (unlikely(reader_count < 0)) {
@@ -1972,17 +1719,7 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- spin_lock(&reader->pi_lock);
- waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);
+ remove_waiter(mutex, waiter);

if (savestate)
wake_up_process_mutex(reader);
@@ -1995,32 +1732,12 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- spin_lock(&pendowner->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
-
- /* This could also be a reader (if reader_limit is set) */
- if (next->write_lock)
- /* add back in as top waiter */
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- rwm->prio = next->task->prio;
- } else
- rwm->prio = MAX_PRIO;
-
out:

spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -2068,7 +1785,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);

- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
rls = &current->owned_read_locks[reader_count];
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
@@ -2076,12 +1793,11 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
rls->count = 1;
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();

if (!rt_mutex_has_waiters(mutex)) {
/* We are sole owner, we are done */
rwm->owner = current;
- rwm->prio = MAX_PRIO;
mutex->owner = NULL;
spin_unlock_irqrestore(&mutex->wait_lock, flags);
return;
@@ -2102,17 +1818,8 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&current->pi_lock);
plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
-
- spin_lock(&reader->pi_lock);
waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);

/* downgrade is only for mutexes */
wake_up_process(reader);
@@ -2123,124 +1830,81 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- /* setup this mutex prio for read */
- rwm->prio = next->task->prio;
-
- spin_lock(&current->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
- /* No need to add back since readers don't have PI waiters */
- } else
- rwm->prio = MAX_PRIO;
-
rt_mutex_set_owner(mutex, RT_RW_READER, 0);

spin_unlock_irqrestore(&mutex->wait_lock, flags);
-
- /*
- * Undo pi boosting when necessary.
- * If one of the awoken readers boosted us, we don't want to keep
- * that priority.
- */
- rt_mutex_adjust_prio(current);
-}
-
-void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
-{
- struct rt_mutex *mutex = &rwm->mutex;
-
- rwm->owner = NULL;
- atomic_set(&rwm->count, 0);
- atomic_set(&rwm->owners, 0);
- rwm->prio = MAX_PRIO;
- INIT_LIST_HEAD(&rwm->readers);
-
- __rt_mutex_init(mutex, name);
}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+/*
+ * These callbacks are invoked whenever a rwlock has changed priority.
+ * Since rwlocks maintain their own lists of reader dependencies, we
+ * may need to reboost any readers manually
+ */
+static inline int rt_rwlock_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct reader_lock_struct *rls;
struct rw_mutex *rwm;
- int lock_prio;
- int i;

- for (i = 0; i < task->reader_lock_count; i++) {
- rls = &task->owned_read_locks[i];
- rwm = rls->lock;
- if (rwm) {
- lock_prio = rwm->prio;
- if (prio > lock_prio)
- prio = lock_prio;
- }
- }
+ rwm = container_of(snk, struct rw_mutex, pi_snk);

- return prio;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ rwm->prio = *src->prio;
+
+ return 0;
}

-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
+static inline int rt_rwlock_pi_update(struct pi_sink *snk,
+ unsigned int flags)
{
+ struct rw_mutex *rwm;
+ struct rt_mutex *mutex;
struct reader_lock_struct *rls;
- struct rt_mutex_waiter *waiter;
- struct task_struct *task;
- struct rw_mutex *rwm = container_of(lock, struct rw_mutex, mutex);
+ unsigned long iflags;

- if (rt_mutex_has_waiters(lock)) {
- waiter = rt_mutex_top_waiter(lock);
- /*
- * Do we need to grab the task->pi_lock?
- * Really, we are only reading it. If it
- * changes, then that should follow this chain
- * too.
- */
- rwm->prio = waiter->task->prio;
- } else
- rwm->prio = MAX_PRIO;
+ rwm = container_of(snk, struct rw_mutex, pi_snk);
+ mutex = &rwm->mutex;

- if (recursion_depth >= MAX_RWLOCK_DEPTH) {
- WARN_ON(1);
- return 1;
- }
+ spin_lock_irqsave(&mutex->wait_lock, iflags);

- list_for_each_entry(rls, &rwm->readers, list) {
- task = rls->task;
- get_task_struct(task);
- /*
- * rt_mutex_adjust_prio_chain will do
- * the put_task_struct
- */
- rt_mutex_adjust_prio_chain(task, 0, orig_lock,
- orig_waiter, top_task,
- recursion_depth+1);
- }
+ list_for_each_entry(rls, &rwm->readers, list)
+ task_pi_boost(rls->task, &rls->pi_src, 0);
+
+ spin_unlock_irqrestore(&mutex->wait_lock, iflags);

return 0;
}
-#else
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
-{
- return 0;
-}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+static struct pi_sink rt_rwlock_pi_snk = {
+ .boost = rt_rwlock_pi_boost,
+ .update = rt_rwlock_pi_update,
+};
+
+void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
{
- return prio;
+ struct rt_mutex *mutex = &rwm->mutex;
+
+ rwm->owner = NULL;
+ atomic_set(&rwm->count, 0);
+ atomic_set(&rwm->owners, 0);
+ rwm->prio = MAX_PRIO;
+ INIT_LIST_HEAD(&rwm->readers);
+
+ __rt_mutex_init(mutex, name);
+
+ /*
+ * Link the rwlock object to the mutex so we get notified
+ * of any priority changes in the future
+ */
+ rwm->pi_snk = rt_rwlock_pi_snk;
+ pi_add_sink(&mutex->pi.node, &rwm->pi_snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
}
+
#endif /* CONFIG_PREEMPT_RT */

static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags)
@@ -2335,8 +1999,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
- detect_deadlock, flags);
+ ret = add_waiter(lock, &waiter, &flags);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
@@ -2374,7 +2037,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
set_current_state(TASK_RUNNING);

if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);

/*
* try_to_take_rt_mutex() sets the waiter bit
@@ -2388,13 +2051,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(timeout))
hrtimer_cancel(&timeout->timer);

- /*
- * Readjust priority, when we did not get the lock. We might
- * have been the pending owner and boosted. Since we did not
- * take the lock, the PI boost has to go.
- */
- if (unlikely(ret))
- rt_mutex_adjust_prio(current);
+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);

/* Must we reaquire the BKL? */
if (unlikely(saved_lock_depth >= 0))
@@ -2457,8 +2115,8 @@ rt_mutex_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

/*
@@ -2654,6 +2312,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
spin_lock_init(&lock->wait_lock);
plist_head_init(&lock->wait_list, &lock->wait_lock);

+ init_pi(lock);
+
debug_rt_mutex_init(lock, name);
}
EXPORT_SYMBOL_GPL(__rt_mutex_init);
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 70df5f5..7bf32d0 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -14,6 +14,7 @@

#include <linux/rtmutex.h>
#include <linux/rt_lock.h>
+#include <linux/pi.h>

/*
* The rtmutex in kernel tester is independent of rtmutex debugging. We
@@ -48,10 +49,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
*/
struct rt_mutex_waiter {
struct plist_node list_entry;
- struct plist_node pi_list_entry;
struct task_struct *task;
struct rt_mutex *lock;
int write_lock;
+ struct {
+ struct pi_sink snk;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;
struct pid *deadlock_task_pid;
@@ -79,18 +83,6 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
return w;
}

-static inline int task_has_pi_waiters(struct task_struct *p)
-{
- return !plist_head_empty(&p->pi_waiters);
-}
-
-static inline struct rt_mutex_waiter *
-task_top_pi_waiter(struct task_struct *p)
-{
- return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
- pi_list_entry);
-}
-
/*
* lock->owner state tracking:
*/
diff --git a/kernel/rwlock_torture.c b/kernel/rwlock_torture.c
index 2820815..689a0d0 100644
--- a/kernel/rwlock_torture.c
+++ b/kernel/rwlock_torture.c
@@ -682,37 +682,7 @@ static int __init mutex_stress_init(void)

print_owned_read_locks(tsks[i]);

- if (tsks[i]->pi_blocked_on) {
- w = (void *)tsks[i]->pi_blocked_on;
- mtx = w->lock;
- spin_unlock_irq(&tsks[i]->pi_lock);
- spin_lock_irq(&mtx->wait_lock);
- spin_lock(&tsks[i]->pi_lock);
- own = (unsigned long)mtx->owner & ~3UL;
- oops_in_progress++;
- printk("%s:%d is blocked on ",
- tsks[i]->comm, tsks[i]->pid);
- __print_symbol("%s", (unsigned long)mtx);
- if (own == 0x100)
- printk(" owner is READER\n");
- else if (!(own & ~300))
- printk(" owner is ILLEGAL!!\n");
- else if (!own)
- printk(" has no owner!\n");
- else {
- struct task_struct *owner = (void*)own;
-
- printk(" owner is %s:%d\n",
- owner->comm, owner->pid);
- }
- oops_in_progress--;
-
- spin_unlock(&tsks[i]->pi_lock);
- spin_unlock_irq(&mtx->wait_lock);
- } else {
- print_owned_read_locks(tsks[i]);
- spin_unlock_irq(&tsks[i]->pi_lock);
- }
+ spin_unlock_irq(&tsks[i]->pi_lock);
}
}
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index c129b10..363fc86 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2390,12 +2390,6 @@ task_pi_init(struct task_struct *p)
pi_source_init(&p->pi.src, &p->normal_prio);
task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);

-#ifdef CONFIG_RT_MUTEXES
- p->rtmutex_prio = MAX_PRIO;
- pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
- task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
-#endif
-
/*
* We add our own task as a dependency of ourselves so that
* we get boost-notifications (via task_pi_boost_cb) whenever
@@ -5006,7 +5000,6 @@ task_pi_update_cb(struct pi_sink *snk, unsigned int flags)
*/
if (unlikely(p == rq->idle)) {
WARN_ON(p != rq->curr);
- WARN_ON(p->pi_blocked_on);
goto out_unlock;
}

@@ -5337,7 +5330,6 @@ recheck:
spin_unlock_irqrestore(&p->pi_lock, flags);

task_pi_update(p, 0);
- rt_mutex_adjust_pi(p);

return 0;
}
@@ -8471,10 +8463,6 @@ void __init sched_init(void)

task_pi_init(&init_task);

-#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
-#endif
-
/*
* The boot idle thread does lazy MMU switching as well:
*/

2008-08-01 21:26:29

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible

Adaptive-locking technology often times acquires the lock by
spinning on a running-owner instead of sleeping. It is unecessary
to go through pi-boosting if the owner is of equal or (logically)
lower priority. Therefore, we can save some significant overhead
by deferring the boost until absolutely necessary. This has shown
to improve overall performance in PREEMPT_RT

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current

Signed-off-by: Gregory Haskins <[email protected]>
CC: Peter Morreale <[email protected]>
---

include/linux/rtmutex.h | 1
kernel/rtmutex.c | 195 ++++++++++++++++++++++++++++++++++++-----------
kernel/rtmutex_common.h | 1
3 files changed, 153 insertions(+), 44 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d984244..1d98107 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -33,6 +33,7 @@ struct rt_mutex {
struct pi_node node;
struct pi_sink snk;
int prio;
+ int boosters;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 0f64298..de213ac 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
struct task_struct *prev_owner = rt_mutex_owner(lock);

rtmutex_pi_owner(lock, prev_owner, 0);
rtmutex_pi_owner(lock, owner, 1);
+ }

+ if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;
- }

lock->owner = (struct task_struct *)val;
}
@@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *snk,

spin_lock_irqsave(&lock->wait_lock, iflags);

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
owner = rt_mutex_owner(lock);

if (owner && owner != RT_RW_READER) {
@@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
pi_node_init(&lock->pi.node);

lock->pi.prio = MAX_PRIO;
+ lock->pi.boosters = 0;
pi_source_init(&lock->pi.src, &lock->pi.prio);
lock->pi.snk = rtmutex_pi_snk;

@@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
}

+static inline void requeue_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ BUG_ON(!waiter->task);
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
/*
* These callbacks are invoked whenever a waiter has changed priority.
* So we should requeue it within the lock->wait_list
@@ -343,11 +355,8 @@ static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
* pi list. Therefore, if waiter->pi.prio has changed since we
* queued ourselves, requeue it.
*/
- if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
- plist_del(&waiter->list_entry, &lock->wait_list);
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
- }
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
+ requeue_waiter(lock, waiter);

spin_unlock_irqrestore(&lock->wait_lock, iflags);

@@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
.update = rtmutex_waiter_pi_update,
};

-/*
- * This must be called with lock->wait_lock held.
- */
-static int add_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long *flags)
+static void boost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
{
- int has_waiters = rt_mutex_has_waiters(lock);
-
- waiter->task = current;
- waiter->lock = lock;
- waiter->pi.prio = current->prio;
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
waiter->pi.snk = rtmutex_waiter_pi_snk;

/*
@@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
* If we previously had no waiters, we are transitioning to
* a mode where we need to boost the owner
*/
- if (!has_waiters) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 1);
}

- spin_unlock_irqrestore(&lock->wait_lock, *flags);
- task_pi_update(current, 0);
- spin_lock_irqsave(&lock->wait_lock, *flags);
-
- return 0;
+ lock->pi.boosters++;
+ waiter->pi.boosted = 1;
}

-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter)
+static void deboost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *p)
{
- struct task_struct *p = waiter->task;
+ BUG_ON(!waiter->pi.boosted);

- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
+ waiter->pi.boosted = 0;
+ lock->pi.boosters--;

/*
* We can stop boosting the owner if there are no more waiters
*/
- if (!rt_mutex_has_waiters(lock)) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 0);
}
@@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
}

/*
+ * This must be called with lock->wait_lock held.
+ */
+static void _add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ _add_waiter(lock, waiter);
+
+ boost_lock(lock, waiter);
+
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ if (waiter->pi.boosted)
+ deboost_lock(lock, waiter, p);
+}
+
+/*
* Wake up the next waiter on the lock.
*
* Remove the top waiter from the current tasks waiter list and from
@@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
if (orig_owner != rt_mutex_owner(waiter->lock))
return 0;

+ /* Special handling for when we are not in pi-boost mode */
+ if (!waiter->pi.boosted) {
+ /*
+ * Are we higher priority than the owner? If so
+ * we should bail out immediately so that we can
+ * pi boost them.
+ */
+ if (current->prio < orig_owner->prio)
+ return 0;
+
+ /*
+ * Did our priority change? If so, we need to
+ * requeue our position in the list
+ */
+ if (waiter->pi.prio != current->prio)
+ return 0;
+ }
+
/* Owner went to bed, so should we */
if (!task_is_current(orig_owner))
return 1;
@@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unsigned long saved_state, state, flags;
struct task_struct *orig_owner;
int missed = 0;
+ int boosted = 0;

init_waiter(&waiter);

@@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
}
missed = 1;

+ orig_owner = rt_mutex_owner(lock);
+
/*
* waiter.task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by an higher prio task.
*/
- if (!waiter.task) {
- add_waiter(lock, &waiter, &flags);
+ if (!waiter.task)
+ _add_waiter(lock, &waiter);
+
+ /*
+ * We only need to pi-boost the owner if they are lower
+ * priority than us. We dont care if this is racy
+ * against priority changes as we will break out of
+ * the adaptive spin anytime any priority changes occur
+ * without boosting enabled.
+ */
+ if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
}

/*
+ * If we are not currently pi-boosting the lock, we have to
+ * monitor whether our priority changed since the last
+ * time it was recorded and requeue ourselves if it moves.
+ */
+ if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
+ waiter.pi.prio = current->prio;
+
+ requeue_waiter(lock, &waiter);
+ }
+
+ /*
* Prevent schedule() to drop BKL, while waiting for
* the lock ! We restore lock_depth when we come back.
*/
saved_flags = current->flags & PF_NOSCHED;
current->lock_depth = -1;
current->flags &= ~PF_NOSCHED;
- orig_owner = rt_mutex_owner(lock);
get_task_struct(orig_owner);
spin_unlock_irqrestore(&lock->wait_lock, flags);

@@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* barrier which we rely upon to ensure current->state
* is visible before we test waiter.task.
*/
+ if (waiter.task && !waiter.pi.boosted) {
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
+ /*
+ * We get here if we have not yet boosted
+ * the lock, yet we are going to sleep. If
+ * we are still pending (waiter.task != 0),
+ * then go ahead and boost them now
+ */
+ if (waiter.task) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+ }
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ }
+
if (waiter.task)
schedule_rt_mutex(lock);
} else
@@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_unlock_irqrestore(&lock->wait_lock, flags);

/* Undo any pi boosting, if necessary */
- task_pi_update(current, 0);
+ if (boosted)
+ task_pi_update(current, 0);

debug_rt_mutex_free_waiter(&waiter);
}
@@ -708,6 +810,7 @@ static void noinline __sched
rt_spin_lock_slowunlock(struct rt_mutex *lock)
{
unsigned long flags;
+ int deboost = 0;

spin_lock_irqsave(&lock->wait_lock, flags);

@@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
return;
}

+ if (lock->pi.boosters)
+ deboost = 1;
+
wakeup_next_waiter(lock, 1);

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting when necessary */
- task_pi_update(current, 0);
+ if (deboost)
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 7bf32d0..34e2381 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -55,6 +55,7 @@ struct rt_mutex_waiter {
struct {
struct pi_sink snk;
int prio;
+ int boosted;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;

2008-08-04 13:23:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible

Gregory Haskins wrote:
> Adaptive-locking technology often times acquires the lock by
> spinning on a running-owner instead of sleeping. It is unecessary
> to go through pi-boosting if the owner is of equal or (logically)
> lower priority. Therefore, we can save some significant overhead
> by deferring the boost until absolutely necessary. This has shown
> to improve overall performance in PREEMPT_RT
>
> Special thanks to Peter Morreale for suggesting the optimization to
> only consider skipping the boost if the owner is >= to current
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Peter Morreale <[email protected]>
>

I received feedback that this prologue was too vague to accurately
describe what this patch does and why it is not broken to use this
optimization. Therefore, here is the new prologue:

-----------------------
From: Gregory Haskins <[email protected]>

rtmutex: pi-boost locks as late as possible

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance. An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart. However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath. The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
each spin of the lock (if we are not boosted already). If the
owner's effective priority is logically less than the waiters
priority, we must boost them.

2) We check the priority of ourselves against our current queue
position on the waiters-list (if we are not boosted already).
If our priority was changed, we need to re-queue ourselves to
update our position.

3) We break out of the adaptive-spin if either of the above
conditions (1), (2) change so that we can re-evaluate the
lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
voluntarily preempt since we are losing our ability to
dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a pro-active protocol
(boost on entry), verses a reactive protocol (boost when
necessary). The upside to the latter is that we don't take a
penalty for pi when it is not necessary. The downside is that we
technically leave the owner exposed to getting preempted, even if
our waiter is the highest priority task in the system. When this
happens, the owner would be immediately boosted (because we would
hit the "oncpu" condition, and subsequently follow the voluntary
preempt path which boosts the owner). Therefore, inversion is
fully prevented, but we have the extra latency of the
preempt/boost/wakeup that could have been avoided in the proactive
model.

However, the design of the algorithm described above constrains the
probability of this phenomenon occurring to setscheduler()
operations. Since rt-locks do not support being interrupted by
signals or timeouts, waiters only depart via the acquisition path.
And while acquisitions do deboost the owner, the owner also
changes simultaneously, rending the deboost moot relative to the
other waiters.

What this all means is that the downside to this implementation is
that a high-priority waiter *may* see an extra latency (equivalent
to roughly two wake-ups) if the owner has its priority reduced via
setscheduler() while it holds the lock. The penalty is
deterministic, arguably small enough, and sufficiently rare that I
do not believe it should be an issue.

Note: If the concept of other exit paths are ever introduced in the
future, simply adapting the condition to look at owner->normal_prio
instead of owner->prio should once again constrain the limitation
to setscheduler().

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Peter Morreale <[email protected]>


> ---
>
> include/linux/rtmutex.h | 1
> kernel/rtmutex.c | 195 ++++++++++++++++++++++++++++++++++++-----------
> kernel/rtmutex_common.h | 1
> 3 files changed, 153 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
> index d984244..1d98107 100644
> --- a/include/linux/rtmutex.h
> +++ b/include/linux/rtmutex.h
> @@ -33,6 +33,7 @@ struct rt_mutex {
> struct pi_node node;
> struct pi_sink snk;
> int prio;
> + int boosters;
> } pi;
> #ifdef CONFIG_DEBUG_RT_MUTEXES
> int save_state;
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 0f64298..de213ac 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
> {
> unsigned long val = (unsigned long)owner | mask;
>
> - if (rt_mutex_has_waiters(lock)) {
> + if (lock->pi.boosters) {
> struct task_struct *prev_owner = rt_mutex_owner(lock);
>
> rtmutex_pi_owner(lock, prev_owner, 0);
> rtmutex_pi_owner(lock, owner, 1);
> + }
>
> + if (rt_mutex_has_waiters(lock))
> val |= RT_MUTEX_HAS_WAITERS;
> - }
>
> lock->owner = (struct task_struct *)val;
> }
> @@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *snk,
>
> spin_lock_irqsave(&lock->wait_lock, iflags);
>
> - if (rt_mutex_has_waiters(lock)) {
> + if (lock->pi.boosters) {
> owner = rt_mutex_owner(lock);
>
> if (owner && owner != RT_RW_READER) {
> @@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
> pi_node_init(&lock->pi.node);
>
> lock->pi.prio = MAX_PRIO;
> + lock->pi.boosters = 0;
> pi_source_init(&lock->pi.src, &lock->pi.prio);
> lock->pi.snk = rtmutex_pi_snk;
>
> @@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
> return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
> }
>
> +static inline void requeue_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + BUG_ON(!waiter->task);
> +
> + plist_del(&waiter->list_entry, &lock->wait_list);
> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
> + plist_add(&waiter->list_entry, &lock->wait_list);
> +}
> +
> /*
> * These callbacks are invoked whenever a waiter has changed priority.
> * So we should requeue it within the lock->wait_list
> @@ -343,11 +355,8 @@ static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
> * pi list. Therefore, if waiter->pi.prio has changed since we
> * queued ourselves, requeue it.
> */
> - if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
> - plist_del(&waiter->list_entry, &lock->wait_list);
> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
> - plist_add(&waiter->list_entry, &lock->wait_list);
> - }
> + if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
> + requeue_waiter(lock, waiter);
>
> spin_unlock_irqrestore(&lock->wait_lock, iflags);
>
> @@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
> .update = rtmutex_waiter_pi_update,
> };
>
> -/*
> - * This must be called with lock->wait_lock held.
> - */
> -static int add_waiter(struct rt_mutex *lock,
> - struct rt_mutex_waiter *waiter,
> - unsigned long *flags)
> +static void boost_lock(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> {
> - int has_waiters = rt_mutex_has_waiters(lock);
> -
> - waiter->task = current;
> - waiter->lock = lock;
> - waiter->pi.prio = current->prio;
> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
> - plist_add(&waiter->list_entry, &lock->wait_list);
> waiter->pi.snk = rtmutex_waiter_pi_snk;
>
> /*
> @@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
> * If we previously had no waiters, we are transitioning to
> * a mode where we need to boost the owner
> */
> - if (!has_waiters) {
> + if (!lock->pi.boosters) {
> struct task_struct *owner = rt_mutex_owner(lock);
> rtmutex_pi_owner(lock, owner, 1);
> }
>
> - spin_unlock_irqrestore(&lock->wait_lock, *flags);
> - task_pi_update(current, 0);
> - spin_lock_irqsave(&lock->wait_lock, *flags);
> -
> - return 0;
> + lock->pi.boosters++;
> + waiter->pi.boosted = 1;
> }
>
> -/*
> - * Remove a waiter from a lock
> - *
> - * Must be called with lock->wait_lock held
> - */
> -static void remove_waiter(struct rt_mutex *lock,
> - struct rt_mutex_waiter *waiter)
> +static void deboost_lock(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter,
> + struct task_struct *p)
> {
> - struct task_struct *p = waiter->task;
> + BUG_ON(!waiter->pi.boosted);
>
> - plist_del(&waiter->list_entry, &lock->wait_list);
> - waiter->task = NULL;
> + waiter->pi.boosted = 0;
> + lock->pi.boosters--;
>
> /*
> * We can stop boosting the owner if there are no more waiters
> */
> - if (!rt_mutex_has_waiters(lock)) {
> + if (!lock->pi.boosters) {
> struct task_struct *owner = rt_mutex_owner(lock);
> rtmutex_pi_owner(lock, owner, 0);
> }
> @@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
> }
>
> /*
> + * This must be called with lock->wait_lock held.
> + */
> +static void _add_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + waiter->task = current;
> + waiter->lock = lock;
> + waiter->pi.prio = current->prio;
> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
> + plist_add(&waiter->list_entry, &lock->wait_list);
> +}
> +
> +static int add_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter,
> + unsigned long *flags)
> +{
> + _add_waiter(lock, waiter);
> +
> + boost_lock(lock, waiter);
> +
> + spin_unlock_irqrestore(&lock->wait_lock, *flags);
> + task_pi_update(current, 0);
> + spin_lock_irqsave(&lock->wait_lock, *flags);
> +
> + return 0;
> +}
> +
> +/*
> + * Remove a waiter from a lock
> + *
> + * Must be called with lock->wait_lock held
> + */
> +static void remove_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + struct task_struct *p = waiter->task;
> +
> + plist_del(&waiter->list_entry, &lock->wait_list);
> + waiter->task = NULL;
> +
> + if (waiter->pi.boosted)
> + deboost_lock(lock, waiter, p);
> +}
> +
> +/*
> * Wake up the next waiter on the lock.
> *
> * Remove the top waiter from the current tasks waiter list and from
> @@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
> if (orig_owner != rt_mutex_owner(waiter->lock))
> return 0;
>
> + /* Special handling for when we are not in pi-boost mode */
> + if (!waiter->pi.boosted) {
> + /*
> + * Are we higher priority than the owner? If so
> + * we should bail out immediately so that we can
> + * pi boost them.
> + */
> + if (current->prio < orig_owner->prio)
> + return 0;
> +
> + /*
> + * Did our priority change? If so, we need to
> + * requeue our position in the list
> + */
> + if (waiter->pi.prio != current->prio)
> + return 0;
> + }
> +
> /* Owner went to bed, so should we */
> if (!task_is_current(orig_owner))
> return 1;
> @@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> unsigned long saved_state, state, flags;
> struct task_struct *orig_owner;
> int missed = 0;
> + int boosted = 0;
>
> init_waiter(&waiter);
>
> @@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> }
> missed = 1;
>
> + orig_owner = rt_mutex_owner(lock);
> +
> /*
> * waiter.task is NULL the first time we come here and
> * when we have been woken up by the previous owner
> * but the lock got stolen by an higher prio task.
> */
> - if (!waiter.task) {
> - add_waiter(lock, &waiter, &flags);
> + if (!waiter.task)
> + _add_waiter(lock, &waiter);
> +
> + /*
> + * We only need to pi-boost the owner if they are lower
> + * priority than us. We dont care if this is racy
> + * against priority changes as we will break out of
> + * the adaptive spin anytime any priority changes occur
> + * without boosting enabled.
> + */
> + if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
> + boost_lock(lock, &waiter);
> + boosted = 1;
> +
> + spin_unlock_irqrestore(&lock->wait_lock, flags);
> + task_pi_update(current, 0);
> + spin_lock_irqsave(&lock->wait_lock, flags);
> +
> /* Wakeup during boost ? */
> if (unlikely(!waiter.task))
> continue;
> }
>
> /*
> + * If we are not currently pi-boosting the lock, we have to
> + * monitor whether our priority changed since the last
> + * time it was recorded and requeue ourselves if it moves.
> + */
> + if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
> + waiter.pi.prio = current->prio;
> +
> + requeue_waiter(lock, &waiter);
> + }
> +
> + /*
> * Prevent schedule() to drop BKL, while waiting for
> * the lock ! We restore lock_depth when we come back.
> */
> saved_flags = current->flags & PF_NOSCHED;
> current->lock_depth = -1;
> current->flags &= ~PF_NOSCHED;
> - orig_owner = rt_mutex_owner(lock);
> get_task_struct(orig_owner);
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> @@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> * barrier which we rely upon to ensure current->state
> * is visible before we test waiter.task.
> */
> + if (waiter.task && !waiter.pi.boosted) {
> + spin_lock_irqsave(&lock->wait_lock, flags);
> +
> + /*
> + * We get here if we have not yet boosted
> + * the lock, yet we are going to sleep. If
> + * we are still pending (waiter.task != 0),
> + * then go ahead and boost them now
> + */
> + if (waiter.task) {
> + boost_lock(lock, &waiter);
> + boosted = 1;
> + }
> +
> + spin_unlock_irqrestore(&lock->wait_lock, flags);
> + task_pi_update(current, 0);
> + }
> +
> if (waiter.task)
> schedule_rt_mutex(lock);
> } else
> @@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> /* Undo any pi boosting, if necessary */
> - task_pi_update(current, 0);
> + if (boosted)
> + task_pi_update(current, 0);
>
> debug_rt_mutex_free_waiter(&waiter);
> }
> @@ -708,6 +810,7 @@ static void noinline __sched
> rt_spin_lock_slowunlock(struct rt_mutex *lock)
> {
> unsigned long flags;
> + int deboost = 0;
>
> spin_lock_irqsave(&lock->wait_lock, flags);
>
> @@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
> return;
> }
>
> + if (lock->pi.boosters)
> + deboost = 1;
> +
> wakeup_next_waiter(lock, 1);
>
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> - /* Undo pi boosting when necessary */
> - task_pi_update(current, 0);
> + if (deboost)
> + /* Undo pi boosting when necessary */
> + task_pi_update(current, 0);
> }
>
> void __lockfunc rt_spin_lock(spinlock_t *lock)
> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
> index 7bf32d0..34e2381 100644
> --- a/kernel/rtmutex_common.h
> +++ b/kernel/rtmutex_common.h
> @@ -55,6 +55,7 @@ struct rt_mutex_waiter {
> struct {
> struct pi_sink snk;
> int prio;
> + int boosted;
> } pi;
> #ifdef CONFIG_DEBUG_RT_MUTEXES
> unsigned long ip;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-05 03:03:48

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible

Gregory Haskins wrote:
> Gregory Haskins wrote:
>> Adaptive-locking technology often times acquires the lock by
>> spinning on a running-owner instead of sleeping. It is unecessary
>> to go through pi-boosting if the owner is of equal or (logically)
>> lower priority. Therefore, we can save some significant overhead
>> by deferring the boost until absolutely necessary. This has shown
>> to improve overall performance in PREEMPT_RT
>>
>> Special thanks to Peter Morreale for suggesting the optimization to
>> only consider skipping the boost if the owner is >= to current
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Peter Morreale <[email protected]>
>>
>
> I received feedback that this prologue was too vague to accurately
> describe what this patch does and why it is not broken to use this
> optimization. Therefore, here is the new prologue:
>
> -----------------------
> From: Gregory Haskins <[email protected]>
>
> rtmutex: pi-boost locks as late as possible
>
> PREEMPT_RT replaces most spinlock_t instances with a preemptible
> real-time lock that supports priority inheritance. An uncontended
> (fastpath) acquisition of this lock has no more overhead than
> its non-rt spinlock_t counterpart. However, the contended case
> has considerably more overhead so that the lock can maintain
> proper priority queue order and support pi-boosting of the lock
> owner, yet remaining fully preemptible.
>
> Instrumentation shows that the majority of acquisitions under most
> workloads falls either into the fastpath category, or the adaptive
> spin category within the slowpath. The necessity to pi-boost a
> lock-owner should be sufficiently rare, yet the slow-path path
> blindly incurs this overhead in 100% of contentions.
>
> Therefore, this patch intends to capitalize on this observation
> in order to reduce overhead and improve acquisition throughput.
> It is important to note that real-time latency is still treated
> as a higher order constraint than throughput, so the full
> pi-protocol is observed using new carefully constructed rules
> around the old concepts.
>
> 1) We check the priority of the owner relative to the waiter on
> each spin of the lock (if we are not boosted already). If the
> owner's effective priority is logically less than the waiters
> priority, we must boost them.
>
> 2) We check the priority of ourselves against our current queue
> position on the waiters-list (if we are not boosted already).
> If our priority was changed, we need to re-queue ourselves to
> update our position.
>
> 3) We break out of the adaptive-spin if either of the above
> conditions (1), (2) change so that we can re-evaluate the
> lock conditions.
>
> 4) We must enter pi-boost mode if, at any time, we decide to
> voluntarily preempt since we are losing our ability to
> dynamically process the conditions above.
>
> Note: We still fully support priority inheritance with this
> protocol, even if we defer the low-level calls to adjust priority.
> The difference is really in terms of being a pro-active protocol
> (boost on entry), verses a reactive protocol (boost when
> necessary). The upside to the latter is that we don't take a
> penalty for pi when it is not necessary. The downside is that we
> technically leave the owner exposed to getting preempted, even if
> our waiter is the highest priority task in the system.

David Holmes (CC'd) pointed out that this statement is a little vague
and confusing as well. The question is: how could the owner be exposed
to preemption since it would presumably be running at or above the
waiters priority or we would have boosted it already? The answer is
that this is in reference to the fact that the owner may have its
priority lowered after we have already made the decision to defer boosting.

Therefore, my updated statement should read:

"The downside is that we technically leave the owner exposed to getting
preempted *should it get asynchronously deprioritized*, even if ...."
As I go on to explain below, this deboosting could only happen as the
result of a setscheduler() call, which I assert should not be cause for
concern. However, I wanted to highlight this phenomenon in the interest
of full disclosure since it is technically a difference in behavior from
the original algorithm. I will update the header with this edit for
clarity.

Thanks for the review, David!

-Greg

> When this
> happens, the owner would be immediately boosted (because we would
> hit the "oncpu" condition, and subsequently follow the voluntary
> preempt path which boosts the owner). Therefore, inversion is
> fully prevented, but we have the extra latency of the
> preempt/boost/wakeup that could have been avoided in the proactive
> model.
>
> However, the design of the algorithm described above constrains the
> probability of this phenomenon occurring to setscheduler()
> operations. Since rt-locks do not support being interrupted by
> signals or timeouts, waiters only depart via the acquisition path.
> And while acquisitions do deboost the owner, the owner also
> changes simultaneously, rending the deboost moot relative to the
> other waiters.
>
> What this all means is that the downside to this implementation is
> that a high-priority waiter *may* see an extra latency (equivalent
> to roughly two wake-ups) if the owner has its priority reduced via
> setscheduler() while it holds the lock. The penalty is
> deterministic, arguably small enough, and sufficiently rare that I
> do not believe it should be an issue.
>
> Note: If the concept of other exit paths are ever introduced in the
> future, simply adapting the condition to look at owner->normal_prio
> instead of owner->prio should once again constrain the limitation
> to setscheduler().
>
> Special thanks to Peter Morreale for suggesting the optimization to
> only consider skipping the boost if the owner is >= to current.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Peter Morreale <[email protected]>
>
>
>> ---
>>
>> include/linux/rtmutex.h | 1 kernel/rtmutex.c | 195
>> ++++++++++++++++++++++++++++++++++++-----------
>> kernel/rtmutex_common.h | 1 3 files changed, 153 insertions(+),
>> 44 deletions(-)
>>
>> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
>> index d984244..1d98107 100644
>> --- a/include/linux/rtmutex.h
>> +++ b/include/linux/rtmutex.h
>> @@ -33,6 +33,7 @@ struct rt_mutex {
>> struct pi_node node;
>> struct pi_sink snk;
>> int prio;
>> + int boosters;
>> } pi;
>> #ifdef CONFIG_DEBUG_RT_MUTEXES
>> int save_state;
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index 0f64298..de213ac 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct
>> task_struct *owner,
>> {
>> unsigned long val = (unsigned long)owner | mask;
>>
>> - if (rt_mutex_has_waiters(lock)) {
>> + if (lock->pi.boosters) {
>> struct task_struct *prev_owner = rt_mutex_owner(lock);
>>
>> rtmutex_pi_owner(lock, prev_owner, 0);
>> rtmutex_pi_owner(lock, owner, 1);
>> + }
>>
>> + if (rt_mutex_has_waiters(lock))
>> val |= RT_MUTEX_HAS_WAITERS;
>> - }
>>
>> lock->owner = (struct task_struct *)val;
>> }
>> @@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct
>> pi_sink *snk,
>>
>> spin_lock_irqsave(&lock->wait_lock, iflags);
>>
>> - if (rt_mutex_has_waiters(lock)) {
>> + if (lock->pi.boosters) {
>> owner = rt_mutex_owner(lock);
>>
>> if (owner && owner != RT_RW_READER) {
>> @@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
>> pi_node_init(&lock->pi.node);
>>
>> lock->pi.prio = MAX_PRIO;
>> + lock->pi.boosters = 0;
>> pi_source_init(&lock->pi.src, &lock->pi.prio);
>> lock->pi.snk = rtmutex_pi_snk;
>>
>> @@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct
>> rt_mutex *lock)
>> return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
>> }
>>
>> +static inline void requeue_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + BUG_ON(!waiter->task);
>> +
>> + plist_del(&waiter->list_entry, &lock->wait_list);
>> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> + plist_add(&waiter->list_entry, &lock->wait_list);
>> +}
>> +
>> /*
>> * These callbacks are invoked whenever a waiter has changed priority.
>> * So we should requeue it within the lock->wait_list
>> @@ -343,11 +355,8 @@ static inline int
>> rtmutex_waiter_pi_update(struct pi_sink *snk,
>> * pi list. Therefore, if waiter->pi.prio has changed since we
>> * queued ourselves, requeue it.
>> */
>> - if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
>> - plist_del(&waiter->list_entry, &lock->wait_list);
>> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> - plist_add(&waiter->list_entry, &lock->wait_list);
>> - }
>> + if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
>> + requeue_waiter(lock, waiter);
>>
>> spin_unlock_irqrestore(&lock->wait_lock, iflags);
>>
>> @@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
>> .update = rtmutex_waiter_pi_update,
>> };
>>
>> -/*
>> - * This must be called with lock->wait_lock held.
>> - */
>> -static int add_waiter(struct rt_mutex *lock,
>> - struct rt_mutex_waiter *waiter,
>> - unsigned long *flags)
>> +static void boost_lock(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> {
>> - int has_waiters = rt_mutex_has_waiters(lock);
>> -
>> - waiter->task = current;
>> - waiter->lock = lock;
>> - waiter->pi.prio = current->prio;
>> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> - plist_add(&waiter->list_entry, &lock->wait_list);
>> waiter->pi.snk = rtmutex_waiter_pi_snk;
>>
>> /*
>> @@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
>> * If we previously had no waiters, we are transitioning to
>> * a mode where we need to boost the owner
>> */
>> - if (!has_waiters) {
>> + if (!lock->pi.boosters) {
>> struct task_struct *owner = rt_mutex_owner(lock);
>> rtmutex_pi_owner(lock, owner, 1);
>> }
>>
>> - spin_unlock_irqrestore(&lock->wait_lock, *flags);
>> - task_pi_update(current, 0);
>> - spin_lock_irqsave(&lock->wait_lock, *flags);
>> -
>> - return 0;
>> + lock->pi.boosters++;
>> + waiter->pi.boosted = 1;
>> }
>>
>> -/*
>> - * Remove a waiter from a lock
>> - *
>> - * Must be called with lock->wait_lock held
>> - */
>> -static void remove_waiter(struct rt_mutex *lock,
>> - struct rt_mutex_waiter *waiter)
>> +static void deboost_lock(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter,
>> + struct task_struct *p)
>> {
>> - struct task_struct *p = waiter->task;
>> + BUG_ON(!waiter->pi.boosted);
>>
>> - plist_del(&waiter->list_entry, &lock->wait_list);
>> - waiter->task = NULL;
>> + waiter->pi.boosted = 0;
>> + lock->pi.boosters--;
>>
>> /*
>> * We can stop boosting the owner if there are no more waiters
>> */
>> - if (!rt_mutex_has_waiters(lock)) {
>> + if (!lock->pi.boosters) {
>> struct task_struct *owner = rt_mutex_owner(lock);
>> rtmutex_pi_owner(lock, owner, 0);
>> }
>> @@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
>> }
>>
>> /*
>> + * This must be called with lock->wait_lock held.
>> + */
>> +static void _add_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + waiter->task = current;
>> + waiter->lock = lock;
>> + waiter->pi.prio = current->prio;
>> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> + plist_add(&waiter->list_entry, &lock->wait_list);
>> +}
>> +
>> +static int add_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter,
>> + unsigned long *flags)
>> +{
>> + _add_waiter(lock, waiter);
>> +
>> + boost_lock(lock, waiter);
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, *flags);
>> + task_pi_update(current, 0);
>> + spin_lock_irqsave(&lock->wait_lock, *flags);
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * Remove a waiter from a lock
>> + *
>> + * Must be called with lock->wait_lock held
>> + */
>> +static void remove_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + struct task_struct *p = waiter->task;
>> +
>> + plist_del(&waiter->list_entry, &lock->wait_list);
>> + waiter->task = NULL;
>> +
>> + if (waiter->pi.boosted)
>> + deboost_lock(lock, waiter, p);
>> +}
>> +
>> +/*
>> * Wake up the next waiter on the lock.
>> *
>> * Remove the top waiter from the current tasks waiter list and from
>> @@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter
>> *waiter,
>> if (orig_owner != rt_mutex_owner(waiter->lock))
>> return 0;
>>
>> + /* Special handling for when we are not in pi-boost mode */
>> + if (!waiter->pi.boosted) {
>> + /*
>> + * Are we higher priority than the owner? If so
>> + * we should bail out immediately so that we can
>> + * pi boost them.
>> + */
>> + if (current->prio < orig_owner->prio)
>> + return 0;
>> +
>> + /*
>> + * Did our priority change? If so, we need to
>> + * requeue our position in the list
>> + */
>> + if (waiter->pi.prio != current->prio)
>> + return 0;
>> + }
>> +
>> /* Owner went to bed, so should we */
>> if (!task_is_current(orig_owner))
>> return 1;
>> @@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> unsigned long saved_state, state, flags;
>> struct task_struct *orig_owner;
>> int missed = 0;
>> + int boosted = 0;
>>
>> init_waiter(&waiter);
>>
>> @@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> }
>> missed = 1;
>>
>> + orig_owner = rt_mutex_owner(lock);
>> +
>> /*
>> * waiter.task is NULL the first time we come here and
>> * when we have been woken up by the previous owner
>> * but the lock got stolen by an higher prio task.
>> */
>> - if (!waiter.task) {
>> - add_waiter(lock, &waiter, &flags);
>> + if (!waiter.task)
>> + _add_waiter(lock, &waiter);
>> +
>> + /*
>> + * We only need to pi-boost the owner if they are lower
>> + * priority than us. We dont care if this is racy
>> + * against priority changes as we will break out of
>> + * the adaptive spin anytime any priority changes occur
>> + * without boosting enabled.
>> + */
>> + if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
>> + boost_lock(lock, &waiter);
>> + boosted = 1;
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, flags);
>> + task_pi_update(current, 0);
>> + spin_lock_irqsave(&lock->wait_lock, flags);
>> +
>> /* Wakeup during boost ? */
>> if (unlikely(!waiter.task))
>> continue;
>> }
>>
>> /*
>> + * If we are not currently pi-boosting the lock, we have to
>> + * monitor whether our priority changed since the last
>> + * time it was recorded and requeue ourselves if it moves.
>> + */
>> + if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
>> + waiter.pi.prio = current->prio;
>> +
>> + requeue_waiter(lock, &waiter);
>> + }
>> +
>> + /*
>> * Prevent schedule() to drop BKL, while waiting for
>> * the lock ! We restore lock_depth when we come back.
>> */
>> saved_flags = current->flags & PF_NOSCHED;
>> current->lock_depth = -1;
>> current->flags &= ~PF_NOSCHED;
>> - orig_owner = rt_mutex_owner(lock);
>> get_task_struct(orig_owner);
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> @@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> * barrier which we rely upon to ensure current->state
>> * is visible before we test waiter.task.
>> */
>> + if (waiter.task && !waiter.pi.boosted) {
>> + spin_lock_irqsave(&lock->wait_lock, flags);
>> +
>> + /*
>> + * We get here if we have not yet boosted
>> + * the lock, yet we are going to sleep. If
>> + * we are still pending (waiter.task != 0),
>> + * then go ahead and boost them now
>> + */
>> + if (waiter.task) {
>> + boost_lock(lock, &waiter);
>> + boosted = 1;
>> + }
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, flags);
>> + task_pi_update(current, 0);
>> + }
>> +
>> if (waiter.task)
>> schedule_rt_mutex(lock);
>> } else
>> @@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> /* Undo any pi boosting, if necessary */
>> - task_pi_update(current, 0);
>> + if (boosted)
>> + task_pi_update(current, 0);
>>
>> debug_rt_mutex_free_waiter(&waiter);
>> }
>> @@ -708,6 +810,7 @@ static void noinline __sched
>> rt_spin_lock_slowunlock(struct rt_mutex *lock)
>> {
>> unsigned long flags;
>> + int deboost = 0;
>>
>> spin_lock_irqsave(&lock->wait_lock, flags);
>>
>> @@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
>> return;
>> }
>>
>> + if (lock->pi.boosters)
>> + deboost = 1;
>> +
>> wakeup_next_waiter(lock, 1);
>>
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> - /* Undo pi boosting when necessary */
>> - task_pi_update(current, 0);
>> + if (deboost)
>> + /* Undo pi boosting when necessary */
>> + task_pi_update(current, 0);
>> }
>>
>> void __lockfunc rt_spin_lock(spinlock_t *lock)
>> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
>> index 7bf32d0..34e2381 100644
>> --- a/kernel/rtmutex_common.h
>> +++ b/kernel/rtmutex_common.h
>> @@ -55,6 +55,7 @@ struct rt_mutex_waiter {
>> struct {
>> struct pi_sink snk;
>> int prio;
>> + int boosted;
>> } pi;
>> #ifdef CONFIG_DEBUG_RT_MUTEXES
>> unsigned long ip;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-15 12:19:11

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 4/8] rtmutex: formally initialize the rt_mutex_waiters

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 26 ++++++++++++++------------
1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
}
#endif

+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+ memset(waiter, 0, sizeof(*waiter));
+
+ debug_rt_mutex_init_waiter(waiter);
+ waiter->task = NULL;
+ waiter->write_lock = 0;
+}
+
/*
* Slow path lock function spin_lock style: this variant is very
* careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
struct task_struct *orig_owner;
int missed = 0;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long saved_state = -1, state, flags;

+ init_waiter(&waiter);
+
spin_lock_irqsave(&mutex->wait_lock, flags);
init_rw_lists(rwm);

@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)

/* Owner is a writer (or a blocked writer). Block on the lock */

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
-
if (mtx) {
/*
* We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long flags, saved_state = -1, state;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
+ init_waiter(&waiter);

/* we do PI different for writers that are blocked */
waiter.write_lock = 1;
@@ -2270,9 +2274,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter waiter;
unsigned long flags;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);

2008-08-15 12:19:31

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 1/8] add generalized priority-inheritance interface

The kernel currently addresses priority-inversion through priority-
inheritence. However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure. This causes a few
problems:

1) This tightly coupled relationship makes it difficult to extend to
other areas of the kernel (for instance, pi-aware wait-queues may
be desirable).
2) Enhancing the rtmutex infrastructure becomes challenging because
there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version. The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target. A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source. For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <[email protected]>
---

Documentation/libpi.txt | 59 +++++
include/linux/pi.h | 278 +++++++++++++++++++++++++
lib/Makefile | 3
lib/pi.c | 516 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 855 insertions(+), 1 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a priority source, and its boosted target.
+
+A pi_source is simply a reference to some arbitrary priority value that may range from 0 (highest prio), to MAX_PRIO (currently 140, lowest prio). A pi_source calls pi_sink.boost() whenever it wishes to boost the sink to (at least minimally) the priority value that the source represents. It uses pi_sink.boost() for both the initial boosting, or for any subsequent refreshes to the value (even if the value is decreasing in logical priority). The policy of the sink will dictate what happens as a result of that boost. Likewise, a pi_source calls pi_sink.deboost() to stop contributing to the sink's minimum priority.
+
+It is important to note that a source is a reference to a priority value, not a value itself. This is one of the concepts that allows the interface to be idempotent, which is important for properly updating a chain of sources and sinks in the proper order. If we passed the priority on the stack, the order in which the system executes could allow the actual value that is set to race.
+
+Nodes:
+
+A pi_node is a convenience object which is simultaneously a source and a sink. As its name suggests, it would typically be deployed as a node in a pi-chain. Other pi_sources can boost a node via its pi_sink.boost() interface. Likewise, a node can boost a fixed number of sinks via the node.add_sink() interface.
+
+Generally speaking, a node takes care of many common operations associated with being a “link in the chain”, such as:
+
+ 1) determining the current priority of the node based on the (logically) highest priority source that is boosting the node.
+ 2) boosting/deboosting upstream sinks whenever the node locally changes priority.
+ 3) taking care to avoid deadlock during a chain update.
+
+Design details:
+
+Destruction:
+
+The pi-library objects are designed to be implicitly-destructable (meaning they do not require an explicit “free()” operation when they are not used anymore). This is important considering their intended use (spinlock_t's which are also implicitly-destructable). As such, any allocations needed for operation must come from internal structure storage as there will be no opportunity to free it later.
+
+Multiple sinks per Node:
+
+We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. “task->pi_blocked_on”). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called “leaf sinks”.
+
+Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
+
+The following diagram depicts an example relationship (warning: cheesy ascii art)
+
+ --------- ---------
+ | leaf | | leaf |
+ --------- ---------
+ / /
+ --------- / ---------- / --------- ---------
+ ->-| node |->---| node |-->---| node |->---| leaf |
+ --------- ---------- --------- ---------
+
+The reason why this was done was to unify the notion of a “sink” to a single interface, rather than having something like task->pi_blocks_on and a separate callback for the leaf action. Instead, any downstream object can be represented by a sink, and the implementation details are hidden (e.g. im a task, im a lock, im a node, im a work-item, im a wait-queue, etc).
+
+Sinkrefs:
+
+Each pi_sink.boost() operation is represented by a unique pi_source to properly facilitate a one node to many source relationship. Therefore, if a pi_node is to act as aggregator to multiple sinks, it implicitly must have one internal pi_source object for every sink that is added (via node.add_sink(). This pi_source object has to be internally managed for the lifetime of the sink reference.
+
+Recall that due to the implicit-destruction requirement above, and the fact that we will typically be executing in a preempt-disabled region, we have to be very careful about how we allocate references to those sinks. More on that next. But long story short we limit the number of sinks to MAX_PI_DEPENDENDICES (currently 5).
+
+Locking:
+
+(work in progress....)
+
+
+
+
+
diff --git a/include/linux/pi.h b/include/linux/pi.h
new file mode 100644
index 0000000..d839d4f
--- /dev/null
+++ b/include/linux/pi.h
@@ -0,0 +1,278 @@
+/*
+ * see Documentation/libpi.txt for details
+ */
+
+#ifndef _LINUX_PI_H
+#define _LINUX_PI_H
+
+#include <linux/list.h>
+#include <linux/plist.h>
+#include <asm/atomic.h>
+
+#define MAX_PI_DEPENDENCIES 5
+
+struct pi_source {
+ struct plist_node list;
+ int *prio;
+ int boosted;
+};
+
+
+#define PI_FLAG_DEFER_UPDATE (1 << 0)
+#define PI_FLAG_ALREADY_BOOSTED (1 << 1)
+#define PI_FLAG_NO_DROPREF (1 << 2)
+
+struct pi_sink {
+ atomic_t refs;
+ int (*boost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*deboost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*update)(struct pi_sink *snk,
+ unsigned int flags);
+ int (*free)(struct pi_sink *snk,
+ unsigned int flags);
+};
+
+enum pi_state {
+ pi_state_boost,
+ pi_state_boosted,
+ pi_state_deboost,
+ pi_state_free,
+};
+
+/*
+ * NOTE: PI must always use a true (e.g. raw) spinlock, since it is used by
+ * rtmutex infrastructure.
+ */
+
+struct pi_sinkref {
+ raw_spinlock_t lock;
+ struct list_head list;
+ enum pi_state state;
+ struct pi_sink *snk;
+ struct pi_source src;
+ atomic_t refs;
+ int prio;
+};
+
+struct pi_sinkref_pool {
+ struct list_head free;
+ struct pi_sinkref data[MAX_PI_DEPENDENCIES];
+ int count;
+};
+
+struct pi_node {
+ raw_spinlock_t lock;
+ int prio;
+ struct pi_sink snk;
+ struct pi_sinkref_pool sinkref_pool;
+ struct list_head snks;
+ struct plist_head srcs;
+};
+
+/**
+ * pi_node_init - initialize a pi_node before use
+ * @node: a node context
+ */
+extern void pi_node_init(struct pi_node *node);
+
+/**
+ * pi_add_sink - add a sink as an downstream object
+ * @node: the node context
+ * @snk: the sink context to add to the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ * PI_FLAG_ALREADY_BOOSTED - Do not perform initial boosting
+ *
+ * This function registers a sink to get notified whenever the
+ * node changes priority.
+ *
+ * Note: By default, this function will schedule the newly added sink
+ * to get an inital boost notification on the next update (even
+ * without the presence of a priority transition). However, if the
+ * ALREADY_BOOSTED flag is specified, the sink is initially marked as
+ * BOOSTED and will only get notified if the node changes priority
+ * in the future.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_add_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_del_sink - del a sink from the current downstream objects
+ * @node: the node context
+ * @snk: the sink context to delete from the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a sink from the node.
+ *
+ * Note: The sink will not actually become fully deboosted until
+ * a call to node.update() successfully returns.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_del_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_source_init - initialize a pi_source before use
+ * @src: a src context
+ * @prio: pointer to a priority value
+ *
+ * A pointer to a priority value is used so that boost and update
+ * are fully idempotent.
+ */
+static inline void
+pi_source_init(struct pi_source *src, int *prio)
+{
+ plist_node_init(&src->list, *prio);
+ src->prio = prio;
+ src->boosted = 0;
+}
+
+/**
+ * pi_boost - boost a node with a pi_source
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function registers a priority source with the node, possibly
+ * boosting its value if the new source is the highest registered source.
+ *
+ * This function is used to both initially register a source, as well as
+ * to notify the node if the value changes in the future (even if the
+ * priority is decreasing).
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_boost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->boost)
+ return snk->boost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_deboost - deboost a pi_source from a node
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a priority source from the node, possibly
+ * deboosting its value if the departing source was the highest
+ * registered source.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_deboost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->deboost)
+ return snk->deboost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_update - force a manual chain update
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * This function will push any priority changes (as a result of
+ * boost/deboost or add_sink/del_sink) down through the chain.
+ * If no changes are necessary, this function is a no-op.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_update(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->update)
+ return snk->update(snk, flags);
+
+ return 0;
+}
+
+/**
+ * pi_sink_dropref - down the reference count, freeing the sink if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_sink_dropref(struct pi_sink *snk, unsigned int flags)
+{
+ if (atomic_dec_and_test(&snk->refs)) {
+ if (snk->free)
+ snk->free(snk, flags);
+ }
+}
+
+
+/**
+ * pi_addref - up the reference count
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_addref(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ atomic_inc(&snk->refs);
+}
+
+/**
+ * pi_dropref - down the reference count, freeing the node if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_dropref(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ pi_sink_dropref(snk, flags);
+}
+
+#endif /* _LINUX_PI_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5187924..df81ad7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -23,7 +23,8 @@ lib-$(CONFIG_SMP) += cpumask.o
lib-y += kobject.o kref.o klist.o

obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
- bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o
+ bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
+ pi.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/pi.c b/lib/pi.c
new file mode 100644
index 0000000..46736e4
--- /dev/null
+++ b/lib/pi.c
@@ -0,0 +1,516 @@
+/*
+ * lib/pi.c
+ *
+ * Priority-Inheritance library
+ *
+ * Copyright (C) 2008 Novell
+ *
+ * Author: Gregory Haskins <[email protected]>
+ *
+ * This code provides a generic framework for preventing priority
+ * inversion by means of priority-inheritance. (see Documentation/libpi.txt
+ * for details)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/pi.h>
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref_pool
+ *-----------------------------------------------------------
+ */
+
+static void
+pi_sinkref_pool_init(struct pi_sinkref_pool *pool)
+{
+ int i;
+
+ INIT_LIST_HEAD(&pool->free);
+ pool->count = 0;
+
+ for (i = 0; i < MAX_PI_DEPENDENCIES; ++i) {
+ struct pi_sinkref *sinkref = &pool->data[i];
+
+ memset(sinkref, 0, sizeof(*sinkref));
+ INIT_LIST_HEAD(&sinkref->list);
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+ }
+}
+
+static struct pi_sinkref *
+pi_sinkref_alloc(struct pi_sinkref_pool *pool)
+{
+ struct pi_sinkref *sinkref;
+
+ BUG_ON(!pool->count);
+
+ if (list_empty(&pool->free))
+ return NULL;
+
+ sinkref = list_first_entry(&pool->free, struct pi_sinkref, list);
+ list_del(&sinkref->list);
+ memset(sinkref, 0, sizeof(*sinkref));
+ pool->count--;
+
+ return sinkref;
+}
+
+static void
+pi_sinkref_free(struct pi_sinkref_pool *pool,
+ struct pi_sinkref *sinkref)
+{
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref
+ *-----------------------------------------------------------
+ */
+
+static inline void
+_pi_sink_addref(struct pi_sinkref *sinkref)
+{
+ atomic_inc(&sinkref->snk->refs);
+ atomic_inc(&sinkref->refs);
+}
+
+static inline void
+_pi_sink_dropref_local(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ if (atomic_dec_and_lock(&sinkref->refs, &node->lock)) {
+ list_del(&sinkref->list);
+ pi_sinkref_free(&node->sinkref_pool, sinkref);
+ spin_unlock(&node->lock);
+ }
+}
+
+static inline void
+_pi_sink_dropref_all(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ struct pi_sink *snk = sinkref->snk;
+
+ _pi_sink_dropref_local(node, sinkref);
+ pi_sink_dropref(snk, 0);
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_node
+ *-----------------------------------------------------------
+ */
+
+static struct pi_node *node_of(struct pi_sink *snk)
+{
+ return container_of(snk, struct pi_node, snk);
+}
+
+static inline void
+__pi_update_prio(struct pi_node *node)
+{
+ if (!plist_head_empty(&node->srcs))
+ node->prio = plist_first(&node->srcs)->prio;
+ else
+ node->prio = MAX_PRIO;
+}
+
+static inline void
+__pi_boost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(src->boosted);
+
+ plist_node_init(&src->list, *src->prio);
+ plist_add(&src->list, &node->srcs);
+ src->boosted = 1;
+
+ __pi_update_prio(node);
+}
+
+static inline void
+__pi_deboost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(!src->boosted);
+
+ plist_del(&src->list, &node->srcs);
+ src->boosted = 0;
+
+ __pi_update_prio(node);
+}
+
+/*
+ * _pi_node_update - update the chain
+ *
+ * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
+ * that need to propagate up the chain. This is a step-wise process where we
+ * have to be careful about locking and preemption. By trying MAX_PI_DEPs
+ * times, we guarantee that this update routine is an effective barrier...
+ * all modifications made prior to the call to this barrier will have completed.
+ *
+ * Deadlock avoidance: This node may participate in a chain of nodes which
+ * form a graph of arbitrary structure. While the graph should technically
+ * never close on itself barring any bugs, we still want to protect against
+ * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
+ * from detecting this potential). To do this, we employ a dual-locking
+ * scheme where we can carefully control the order. That is: node->lock
+ * protects most of the node's internal state, but it will never be held
+ * across a chain update. sinkref->lock, on the other hand, can be held
+ * across a boost/deboost, and also guarantees proper execution order. Also
+ * note that no locks are held across an snk->update.
+ */
+static int
+_pi_node_update(struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ struct pi_sinkref *sinkref;
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ struct updater {
+ int update;
+ struct pi_sinkref *sinkref;
+ struct pi_sink *snk;
+ } updaters[MAX_PI_DEPENDENCIES];
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ list_for_each_entry(sinkref, &node->snks, list) {
+ /*
+ * If the priority is changing, or if this is a
+ * BOOST/DEBOOST, we consider this sink "stale"
+ */
+ if (sinkref->prio != node->prio
+ || sinkref->state != pi_state_boosted) {
+ struct updater *iter = &updaters[count++];
+
+ BUG_ON(!atomic_read(&sinkref->snk->refs));
+ _pi_sink_addref(sinkref);
+ sinkref->prio = node->prio;
+
+ iter->update = 1;
+ iter->sinkref = sinkref;
+ iter->snk = sinkref->snk;
+ }
+ }
+
+ spin_unlock(&node->lock);
+
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+ unsigned int lflags = PI_FLAG_DEFER_UPDATE;
+ struct pi_sink *snk;
+
+ sinkref = iter->sinkref;
+ snk = iter->snk;
+
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ sinkref->state = pi_state_boosted;
+ /* Fall through */
+ case pi_state_boosted:
+ snk->boost(snk, &sinkref->src, lflags);
+ break;
+ case pi_state_deboost:
+ snk->deboost(snk, &sinkref->src, lflags);
+ sinkref->state = pi_state_free;
+
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * the above.
+ */
+ _pi_sink_dropref_all(node, sinkref);
+ break;
+ case pi_state_free:
+ iter->update = 0;
+ break;
+ default:
+ panic("illegal sinkref type: %d", sinkref->state);
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ /*
+ * We will drop the sinkref reference while still holding the
+ * preempt/irqs off so that the memory is returned synchronously
+ * to the system.
+ */
+ _pi_sink_dropref_local(node, sinkref);
+
+ /*
+ * The sinkref is no longer valid since we dropped the reference
+ * above, so symbolically drop it here too to make it more
+ * obvious if we try to use it later
+ */
+ iter->sinkref = NULL;
+ }
+
+ local_irq_restore(iflags);
+
+ /*
+ * Note: At this point, sinkref is invalid since we dropref'd
+ * it above, but snk is valid since we still hold the remote
+ * reference. This is key to the design because it allows us
+ * to synchronously free the sinkref object, yet maintain a
+ * reference to the sink across the update
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ if (iter->update)
+ iter->snk->update(iter->snk, 0);
+ }
+
+ /*
+ * We perform all the free opertations together at the end, using
+ * only automatic/stack variables since any one of these operations
+ * could result in our node object being deallocated
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ pi_sink_dropref(iter->snk, 0);
+ }
+
+ return 0;
+}
+
+static void
+_pi_del_sinkref(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ struct pi_sink *snk = sinkref->snk;
+ int remove = 0;
+ unsigned long iflags;
+
+ local_irq_save(iflags);
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ /*
+ * This state indicates the sink was never formally
+ * boosted so we can just delete it immediately
+ */
+ remove = 1;
+ break;
+ case pi_state_boosted:
+ if (snk->deboost)
+ /*
+ * If the sink supports deboost notification,
+ * schedule it for deboost at the next update
+ */
+ sinkref->state = pi_state_deboost;
+ else
+ /*
+ * ..otherwise schedule it for immediate
+ * removal
+ */
+ remove = 1;
+ break;
+ default:
+ break;
+ }
+
+ if (remove) {
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * when the caller performed the lookup
+ */
+ _pi_sink_dropref_all(node, sinkref);
+ sinkref->state = pi_state_free;
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ _pi_sink_dropref_local(node, sinkref);
+ local_irq_restore(iflags);
+
+ pi_sink_dropref(snk, 0);
+}
+
+static int
+_pi_node_boost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ if (src->boosted)
+ __pi_deboost(node, src);
+ __pi_boost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_deboost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ __pi_deboost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_free(struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ /*
+ * When the node is freed, we should perform an implicit
+ * del_sink on any remaining sinks we may have
+ */
+ list_for_each_entry(sinkref, &node->snks, list) {
+ _pi_sink_addref(sinkref);
+ sinkrefs[count++] = sinkref;
+ }
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ for (i = 0; i < count; ++i)
+ _pi_del_sinkref(node, sinkrefs[i]);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+}
+
+static struct pi_sink pi_node_snk = {
+ .boost = _pi_node_boost,
+ .deboost = _pi_node_deboost,
+ .update = _pi_node_update,
+ .free = _pi_node_free,
+};
+
+void pi_node_init(struct pi_node *node)
+{
+ spin_lock_init(&node->lock);
+ node->prio = MAX_PRIO;
+ node->snk = pi_node_snk;
+ pi_sinkref_pool_init(&node->sinkref_pool);
+ INIT_LIST_HEAD(&node->snks);
+ plist_head_init(&node->srcs, &node->lock);
+ atomic_set(&node->snk.refs, 1);
+}
+
+int pi_add_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ int ret = 0;
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ if (!atomic_read(&node->snk.refs)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ sinkref = pi_sinkref_alloc(&node->sinkref_pool);
+ if (!sinkref) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_init(&sinkref->lock);
+ INIT_LIST_HEAD(&sinkref->list);
+
+ if (flags & PI_FLAG_ALREADY_BOOSTED)
+ sinkref->state = pi_state_boosted;
+ else
+ /*
+ * Schedule it for addition at the next update
+ */
+ sinkref->state = pi_state_boost;
+
+ sinkref->prio = node->prio;
+ pi_source_init(&sinkref->src, &sinkref->prio);
+ sinkref->snk = snk;
+
+ /* set one ref from ourselves. It will be dropped on del_sink */
+ atomic_inc(&sinkref->snk->refs);
+ atomic_set(&sinkref->refs, 1);
+
+ list_add_tail(&sinkref->list, &node->snks);
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+
+ out:
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ return ret;
+}
+
+int pi_del_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ /*
+ * There may be multiple matches to snk because sometimes a
+ * deboost/free may still be pending an update when the same
+ * node has been added. So we want to process all instances
+ */
+ list_for_each_entry(sinkref, &node->snks, list) {
+ if (sinkref->snk == snk) {
+ _pi_sink_addref(sinkref);
+ sinkrefs[count++] = sinkref;
+ }
+ }
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ for (i = 0; i < count; ++i)
+ _pi_del_sinkref(node, sinkrefs[i]);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+}
+
+
+

2008-08-15 12:20:04

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 8/8] rtmutex: pi-boost locks as late as possible

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance. An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart. However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath. The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
each spin of the lock (if we are not boosted already). If the
owner's effective priority is logically less than the waiters
priority, we must boost them.

2) We check the priority of ourselves against our current queue
position on the waiters-list (if we are not boosted already).
If our priority was changed, we need to re-queue ourselves to
update our position.

3) We break out of the adaptive-spin if either of the above
conditions (1), (2) change so that we can re-evaluate the
lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
voluntarily preempt since we are losing our ability to
dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a pro-active protocol
(boost on entry), verses a reactive protocol (boost when
necessary). The upside to the latter is that we don't take a
penalty for pi when it is not necessary (which is most of the time)
The downside is that we technically leave the owner exposed to
getting preempted (should it get asynchronously deprioritized), even
if our waiter is the highest priority task in the system. When this
happens, the owner would be immediately boosted (because we would
hit the "oncpu" condition, and subsequently follow the voluntary
preempt path which boosts the owner). Therefore, inversion is
correctly prevented, but we have the extra latency of the
preempt/boost/wakeup that could have been avoided in the proactive
model.

However, the design of the algorithm described above constrains the
probability of this phenomenon occurring to setscheduler()
operations. Since rt-locks do not support being interrupted by
signals or timeouts, waiters only depart via the acquisition path.
And while acquisitions do deboost the owner, the owner also
changes simultaneously, rending the deboost moot relative to the
other waiters.

What this all means is that the downside to this implementation is
that a high-priority waiter *may* see an extra latency (equivalent
to roughly two wake-ups) if the owner has its priority reduced via
setscheduler() while it holds the lock. The penalty is
deterministic, arguably small enough, and sufficiently rare that I
do not believe it should be an issue.

Note: If the concept of other exit paths are ever introduced in the
future, simply adapting the condition to look at owner->normal_prio
instead of owner->prio should once again constrain the limitation
to setscheduler().

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Peter Morreale <[email protected]>
---

include/linux/rtmutex.h | 1
kernel/rtmutex.c | 195 ++++++++++++++++++++++++++++++++++++-----------
kernel/rtmutex_common.h | 1
3 files changed, 153 insertions(+), 44 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d984244..1d98107 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -33,6 +33,7 @@ struct rt_mutex {
struct pi_node node;
struct pi_sink snk;
int prio;
+ int boosters;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 0f64298..de213ac 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
struct task_struct *prev_owner = rt_mutex_owner(lock);

rtmutex_pi_owner(lock, prev_owner, 0);
rtmutex_pi_owner(lock, owner, 1);
+ }

+ if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;
- }

lock->owner = (struct task_struct *)val;
}
@@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *snk,

spin_lock_irqsave(&lock->wait_lock, iflags);

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
owner = rt_mutex_owner(lock);

if (owner && owner != RT_RW_READER) {
@@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
pi_node_init(&lock->pi.node);

lock->pi.prio = MAX_PRIO;
+ lock->pi.boosters = 0;
pi_source_init(&lock->pi.src, &lock->pi.prio);
lock->pi.snk = rtmutex_pi_snk;

@@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
}

+static inline void requeue_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ BUG_ON(!waiter->task);
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
/*
* These callbacks are invoked whenever a waiter has changed priority.
* So we should requeue it within the lock->wait_list
@@ -343,11 +355,8 @@ static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
* pi list. Therefore, if waiter->pi.prio has changed since we
* queued ourselves, requeue it.
*/
- if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
- plist_del(&waiter->list_entry, &lock->wait_list);
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
- }
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
+ requeue_waiter(lock, waiter);

spin_unlock_irqrestore(&lock->wait_lock, iflags);

@@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
.update = rtmutex_waiter_pi_update,
};

-/*
- * This must be called with lock->wait_lock held.
- */
-static int add_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long *flags)
+static void boost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
{
- int has_waiters = rt_mutex_has_waiters(lock);
-
- waiter->task = current;
- waiter->lock = lock;
- waiter->pi.prio = current->prio;
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
waiter->pi.snk = rtmutex_waiter_pi_snk;

/*
@@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
* If we previously had no waiters, we are transitioning to
* a mode where we need to boost the owner
*/
- if (!has_waiters) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 1);
}

- spin_unlock_irqrestore(&lock->wait_lock, *flags);
- task_pi_update(current, 0);
- spin_lock_irqsave(&lock->wait_lock, *flags);
-
- return 0;
+ lock->pi.boosters++;
+ waiter->pi.boosted = 1;
}

-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter)
+static void deboost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *p)
{
- struct task_struct *p = waiter->task;
+ BUG_ON(!waiter->pi.boosted);

- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
+ waiter->pi.boosted = 0;
+ lock->pi.boosters--;

/*
* We can stop boosting the owner if there are no more waiters
*/
- if (!rt_mutex_has_waiters(lock)) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 0);
}
@@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
}

/*
+ * This must be called with lock->wait_lock held.
+ */
+static void _add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ _add_waiter(lock, waiter);
+
+ boost_lock(lock, waiter);
+
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ if (waiter->pi.boosted)
+ deboost_lock(lock, waiter, p);
+}
+
+/*
* Wake up the next waiter on the lock.
*
* Remove the top waiter from the current tasks waiter list and from
@@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
if (orig_owner != rt_mutex_owner(waiter->lock))
return 0;

+ /* Special handling for when we are not in pi-boost mode */
+ if (!waiter->pi.boosted) {
+ /*
+ * Are we higher priority than the owner? If so
+ * we should bail out immediately so that we can
+ * pi boost them.
+ */
+ if (current->prio < orig_owner->prio)
+ return 0;
+
+ /*
+ * Did our priority change? If so, we need to
+ * requeue our position in the list
+ */
+ if (waiter->pi.prio != current->prio)
+ return 0;
+ }
+
/* Owner went to bed, so should we */
if (!task_is_current(orig_owner))
return 1;
@@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unsigned long saved_state, state, flags;
struct task_struct *orig_owner;
int missed = 0;
+ int boosted = 0;

init_waiter(&waiter);

@@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
}
missed = 1;

+ orig_owner = rt_mutex_owner(lock);
+
/*
* waiter.task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by an higher prio task.
*/
- if (!waiter.task) {
- add_waiter(lock, &waiter, &flags);
+ if (!waiter.task)
+ _add_waiter(lock, &waiter);
+
+ /*
+ * We only need to pi-boost the owner if they are lower
+ * priority than us. We dont care if this is racy
+ * against priority changes as we will break out of
+ * the adaptive spin anytime any priority changes occur
+ * without boosting enabled.
+ */
+ if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
}

/*
+ * If we are not currently pi-boosting the lock, we have to
+ * monitor whether our priority changed since the last
+ * time it was recorded and requeue ourselves if it moves.
+ */
+ if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
+ waiter.pi.prio = current->prio;
+
+ requeue_waiter(lock, &waiter);
+ }
+
+ /*
* Prevent schedule() to drop BKL, while waiting for
* the lock ! We restore lock_depth when we come back.
*/
saved_flags = current->flags & PF_NOSCHED;
current->lock_depth = -1;
current->flags &= ~PF_NOSCHED;
- orig_owner = rt_mutex_owner(lock);
get_task_struct(orig_owner);
spin_unlock_irqrestore(&lock->wait_lock, flags);

@@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* barrier which we rely upon to ensure current->state
* is visible before we test waiter.task.
*/
+ if (waiter.task && !waiter.pi.boosted) {
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
+ /*
+ * We get here if we have not yet boosted
+ * the lock, yet we are going to sleep. If
+ * we are still pending (waiter.task != 0),
+ * then go ahead and boost them now
+ */
+ if (waiter.task) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+ }
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ }
+
if (waiter.task)
schedule_rt_mutex(lock);
} else
@@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_unlock_irqrestore(&lock->wait_lock, flags);

/* Undo any pi boosting, if necessary */
- task_pi_update(current, 0);
+ if (boosted)
+ task_pi_update(current, 0);

debug_rt_mutex_free_waiter(&waiter);
}
@@ -708,6 +810,7 @@ static void noinline __sched
rt_spin_lock_slowunlock(struct rt_mutex *lock)
{
unsigned long flags;
+ int deboost = 0;

spin_lock_irqsave(&lock->wait_lock, flags);

@@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
return;
}

+ if (lock->pi.boosters)
+ deboost = 1;
+
wakeup_next_waiter(lock, 1);

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting when necessary */
- task_pi_update(current, 0);
+ if (deboost)
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 7bf32d0..34e2381 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -55,6 +55,7 @@ struct rt_mutex_waiter {
struct {
struct pi_sink snk;
int prio;
+ int boosted;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;

2008-08-15 12:19:48

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 7/8] rtmutex: convert rtmutexes to fully use the PI library

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code. This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 15 -
include/linux/sched.h | 21 -
kernel/fork.c | 2
kernel/rcupreempt-boost.c | 2
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 944 ++++++++++++++-------------------------------
kernel/rtmutex_common.h | 18 -
kernel/rwlock_torture.c | 32 --
kernel/sched.c | 12 -
11 files changed, 321 insertions(+), 735 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..d0ef0f1 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
#include <asm/atomic.h>
#include <linux/spinlock_types.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>

#ifdef CONFIG_PREEMPT_RT
/*
@@ -67,6 +68,7 @@ struct rw_mutex {
atomic_t count; /* number of times held for read */
atomic_t owners; /* number of owners as readers */
struct list_head readers;
+ struct pi_sink pi_snk;
int prio;
};

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..d984244 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
#include <linux/linkage.h>
#include <linux/plist.h>
#include <linux/spinlock_types.h>
+#include <linux/pi.h>

/**
* The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
raw_spinlock_t wait_lock;
struct plist_head wait_list;
struct task_struct *owner;
+ struct {
+ struct pi_source src;
+ struct pi_node node;
+ struct pi_sink snk;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
const char *name, *file;
@@ -96,12 +103,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);

extern void rt_mutex_unlock(struct rt_mutex *lock);

-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk) \
- .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, &tsk.pi_lock), \
- INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9132b42..d59c804 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1106,6 +1106,7 @@ struct reader_lock_struct {
struct rw_mutex *lock;
struct list_head list;
struct task_struct *task;
+ struct pi_source pi_src;
int count;
};

@@ -1309,15 +1310,6 @@ struct task_struct {

} pi;

-#ifdef CONFIG_RT_MUTEXES
- /* PI waiters blocked on a rt_mutex held by this task */
- struct plist_head pi_waiters;
- /* Deadlock detection and priority inheritance handling */
- struct rt_mutex_waiter *pi_blocked_on;
- int rtmutex_prio;
- struct pi_source rtmutex_prio_src;
-#endif
-
#ifdef CONFIG_DEBUG_MUTEXES
/* mutex deadlock detection */
struct mutex_waiter *blocked_on;
@@ -1806,17 +1798,6 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-#ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_adjust_pi(struct task_struct *p);
-#else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
- return p->normal_prio;
-}
-# define rt_mutex_adjust_pi(p) do { } while (0)
-#endif
-
extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0a9..759c6de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -885,8 +885,6 @@ static void rt_mutex_init_task(struct task_struct *p)
{
spin_lock_init(&p->pi_lock);
#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&p->pi_waiters, &p->pi_lock);
- p->pi_blocked_on = NULL;
# ifdef CONFIG_DEBUG_RT_MUTEXES
p->last_kernel_lock = NULL;
# endif
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index e8d9d76..85b3c2b 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -424,7 +424,7 @@ void rcu_boost_readers(void)

spin_lock_irqsave(&rcu_boost_wake_lock, flags);

- prio = rt_mutex_getprio(curr);
+ prio = get_rcu_prio(curr);

rcu_trace_boost_try_boost_readers(RCU_BOOST_ME);

diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 0d9cb54..2034ce1 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -57,8 +57,6 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)

void rt_mutex_debug_task_free(struct task_struct *task)
{
- DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
- DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
#ifdef CONFIG_PREEMPT_RT
WARN_ON(task->reader_lock_count);
#endif
@@ -156,7 +154,6 @@ void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
plist_node_init(&waiter->list_entry, MAX_PRIO);
- plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
waiter->deadlock_task_pid = NULL;
}

@@ -164,7 +161,6 @@ void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
{
put_pid(waiter->deadlock_task_pid);
DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
- DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
DEBUG_LOCKS_WARN_ON(waiter->task);
memset(waiter, 0x22, sizeof(*waiter));
}
diff --git a/kernel/rtmutex-tester.c b/kernel/rtmutex-tester.c
index 092e4c6..dff8781 100644
--- a/kernel/rtmutex-tester.c
+++ b/kernel/rtmutex-tester.c
@@ -373,11 +373,11 @@ static ssize_t sysfs_test_status(struct sys_device *dev, char *buf)
spin_lock(&rttest_lock);

curr += sprintf(curr,
- "O: %4d, E:%8d, S: 0x%08lx, P: %4d, N: %4d, B: %p, K: %d, M:",
+ "O: %4d, E:%8d, S: 0x%08lx, P: %4d, N: %4d, K: %d M:",
td->opcode, td->event, tsk->state,
(MAX_RT_PRIO - 1) - tsk->prio,
(MAX_RT_PRIO - 1) - tsk->normal_prio,
- tsk->pi_blocked_on, td->bkl);
+ td->bkl);

for (i = MAX_RT_TEST_MUTEXES - 1; i >=0 ; i--)
curr += sprintf(curr, "%d", td->mutexes[i]);
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 62fdc3d..0f64298 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -58,14 +58,32 @@
* state.
*/

+static inline void
+rtmutex_pi_owner(struct rt_mutex *lock, struct task_struct *p, int add)
+{
+ if (!p || p == RT_RW_READER)
+ return;
+
+ if (add)
+ task_pi_boost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+ else
+ task_pi_deboost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+}
+
static void
rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
unsigned long mask)
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock))
+ if (rt_mutex_has_waiters(lock)) {
+ struct task_struct *prev_owner = rt_mutex_owner(lock);
+
+ rtmutex_pi_owner(lock, prev_owner, 0);
+ rtmutex_pi_owner(lock, owner, 1);
+
val |= RT_MUTEX_HAS_WAITERS;
+ }

lock->owner = (struct task_struct *)val;
}
@@ -134,245 +152,88 @@ static inline int task_is_reader(struct task_struct *task) { return 0; }
#endif

int pi_initialized;
-
-/*
- * we initialize the wait_list runtime. (Could be done build-time and/or
- * boot-time.)
- */
-static inline void init_lists(struct rt_mutex *lock)
+static inline int rtmutex_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- if (unlikely(!lock->wait_list.prio_list.prev)) {
- plist_head_init(&lock->wait_list, &lock->wait_lock);
-#ifdef CONFIG_DEBUG_RT_MUTEXES
- pi_initialized++;
-#endif
- }
-}
-
-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio);
-
-/*
- * Calculate task priority from the waiter list priority
- *
- * Return task->normal_prio when the waiter list is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
- int prio = min(task->normal_prio, get_rcu_prio(task));
-
- prio = rt_mutex_get_readers_prio(task, prio);
-
- if (likely(!task_has_pi_waiters(task)))
- return prio;
-
- return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
-}
+ struct rt_mutex *lock = container_of(snk, struct rt_mutex, pi.snk);

-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
- int prio = rt_mutex_getprio(task);
-
- if (task->rtmutex_prio != prio) {
- task->rtmutex_prio = prio;
- task_pi_boost(task, &task->rtmutex_prio_src, 0);
- }
-}
-
-/*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-static void rt_mutex_adjust_prio(struct task_struct *task)
-{
- unsigned long flags;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ lock->pi.prio = *src->prio;

- spin_lock_irqsave(&task->pi_lock, flags);
- __rt_mutex_adjust_prio(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
+ return 0;
}

-/*
- * Max number of times we'll walk the boosting chain:
- */
-int max_lock_depth = 1024;
-
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth);
-/*
- * Adjust the priority chain. Also used for deadlock detection.
- * Decreases task's usage by one - may thus free the task.
- * Returns 0 or -EDEADLK.
- */
-static int rt_mutex_adjust_prio_chain(struct task_struct *task,
- int deadlock_detect,
- struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- int recursion_depth)
+static inline int rtmutex_pi_update(struct pi_sink *snk,
+ unsigned int flags)
{
- struct rt_mutex *lock;
- struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
- int detect_deadlock, ret = 0, depth = 0;
- unsigned long flags;
+ struct rt_mutex *lock = container_of(snk, struct rt_mutex, pi.snk);
+ struct task_struct *owner = NULL;
+ unsigned long iflags;

- detect_deadlock = debug_rt_mutex_detect_deadlock(orig_waiter,
- deadlock_detect);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- /*
- * The (de)boosting is a step by step approach with a lot of
- * pitfalls. We want this to be preemptible and we want hold a
- * maximum of two locks per step. So we have to check
- * carefully whether things change under us.
- */
- again:
- if (++depth > max_lock_depth) {
- static int prev_max;
+ if (rt_mutex_has_waiters(lock)) {
+ owner = rt_mutex_owner(lock);

- /*
- * Print this only once. If the admin changes the limit,
- * print a new message when reaching the limit again.
- */
- if (prev_max != max_lock_depth) {
- prev_max = max_lock_depth;
- printk(KERN_WARNING "Maximum lock depth %d reached "
- "task: %s (%d)\n", max_lock_depth,
- top_task->comm, task_pid_nr(top_task));
+ if (owner && owner != RT_RW_READER) {
+ rtmutex_pi_owner(lock, owner, 1);
+ get_task_struct(owner);
}
- put_task_struct(task);
-
- return deadlock_detect ? -EDEADLK : 0;
}
- retry:
- /*
- * Task can not go away as we did a get_task() before !
- */
- spin_lock_irqsave(&task->pi_lock, flags);

- waiter = task->pi_blocked_on;
- /*
- * Check whether the end of the boosting chain has been
- * reached or the state of the chain has changed while we
- * dropped the locks.
- */
- if (!waiter || !waiter->task)
- goto out_unlock_pi;
-
- /*
- * Check the orig_waiter state. After we dropped the locks,
- * the previous owner of the lock might have released the lock
- * and made us the pending owner:
- */
- if (orig_waiter && !orig_waiter->task)
- goto out_unlock_pi;
-
- /*
- * Drop out, when the task has no waiters. Note,
- * top_waiter can be NULL, when we are in the deboosting
- * mode!
- */
- if (top_waiter && (!task_has_pi_waiters(task) ||
- top_waiter != task_top_pi_waiter(task)))
- goto out_unlock_pi;
-
- /*
- * When deadlock detection is off then we check, if further
- * priority adjustment is necessary.
- */
- if (!detect_deadlock && waiter->list_entry.prio == task->prio)
- goto out_unlock_pi;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);

- lock = waiter->lock;
- if (!spin_trylock(&lock->wait_lock)) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- cpu_relax();
- goto retry;
+ if (owner && owner != RT_RW_READER) {
+ task_pi_update(owner, 0);
+ put_task_struct(owner);
}

- /* Deadlock detection */
- if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
- debug_rt_mutex_deadlock(deadlock_detect, orig_waiter, lock);
- spin_unlock(&lock->wait_lock);
- ret = deadlock_detect ? -EDEADLK : 0;
- goto out_unlock_pi;
- }
+ return 0;
+}

- top_waiter = rt_mutex_top_waiter(lock);
+static struct pi_sink rtmutex_pi_snk = {
+ .boost = rtmutex_pi_boost,
+ .update = rtmutex_pi_update,
+};

- /* Requeue the waiter */
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->list_entry.prio = task->prio;
- plist_add(&waiter->list_entry, &lock->wait_list);
-
- /* Release the task */
- spin_unlock(&task->pi_lock);
- put_task_struct(task);
+static void init_pi(struct rt_mutex *lock)
+{
+ pi_node_init(&lock->pi.node);

- /* Grab the next task */
- task = rt_mutex_owner(lock);
+ lock->pi.prio = MAX_PRIO;
+ pi_source_init(&lock->pi.src, &lock->pi.prio);
+ lock->pi.snk = rtmutex_pi_snk;

- /*
- * Readers are special. We may need to boost more than one owner.
- */
- if (task_is_reader(task)) {
- ret = rt_mutex_adjust_readers(orig_lock, orig_waiter,
- top_task, lock,
- recursion_depth);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
- goto out;
- }
+ pi_add_sink(&lock->pi.node, &lock->pi.snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}

- get_task_struct(task);
- spin_lock(&task->pi_lock);
-
- if (waiter == rt_mutex_top_waiter(lock)) {
- /* Boost the owner */
- plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
-
- } else if (top_waiter == waiter) {
- /* Deboost the owner */
- plist_del(&waiter->pi_list_entry, &task->pi_waiters);
- waiter = rt_mutex_top_waiter(lock);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
+/*
+ * we initialize the wait_list runtime. (Could be done build-time and/or
+ * boot-time.)
+ */
+static inline void init_lists(struct rt_mutex *lock)
+{
+ if (unlikely(!lock->wait_list.prio_list.prev)) {
+ plist_head_init(&lock->wait_list, &lock->wait_lock);
+ init_pi(lock);
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+ pi_initialized++;
+#endif
}
-
- spin_unlock(&task->pi_lock);
-
- top_waiter = rt_mutex_top_waiter(lock);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- if (!detect_deadlock && waiter != top_waiter)
- goto out_put_task;
-
- goto again;
-
- out_unlock_pi:
- spin_unlock_irqrestore(&task->pi_lock, flags);
- out_put_task:
- put_task_struct(task);
- out:
- return ret;
}

/*
+ * Max number of times we'll walk the boosting chain:
+ */
+int max_lock_depth = 1024;
+
+/*
* Optimization: check if we can steal the lock from the
* assigned pending owner [which might not have taken the
* lock yet]:
@@ -380,7 +241,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *next;

if (!rt_mutex_owner_pending(lock))
return 0;
@@ -390,49 +250,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)

WARN_ON(task_is_reader(rt_mutex_owner(lock)));

- spin_lock(&pendowner->pi_lock);
- if (!lock_is_stealable(pendowner, mode)) {
- spin_unlock(&pendowner->pi_lock);
- return 0;
- }
-
- /*
- * Check if a waiter is enqueued on the pending owners
- * pi_waiters list. Remove it and readjust pending owners
- * priority.
- */
- if (likely(!rt_mutex_has_waiters(lock))) {
- spin_unlock(&pendowner->pi_lock);
- return 1;
- }
-
- /* No chain handling, pending owner is not blocked on anything: */
- next = rt_mutex_top_waiter(lock);
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
- __rt_mutex_adjust_prio(pendowner);
- spin_unlock(&pendowner->pi_lock);
-
- /*
- * We are going to steal the lock and a waiter was
- * enqueued on the pending owners pi_waiters queue. So
- * we have to enqueue this waiter into
- * current->pi_waiters list. This covers the case,
- * where current is boosted because it holds another
- * lock and gets unboosted because the booster is
- * interrupted, so we would delay a waiter with higher
- * priority as current->normal_prio.
- *
- * Note: in the rare case of a SCHED_OTHER task changing
- * its priority and thus stealing the lock, next->task
- * might be current:
- */
- if (likely(next->task != current)) {
- spin_lock(&current->pi_lock);
- plist_add(&next->pi_list_entry, &current->pi_waiters);
- __rt_mutex_adjust_prio(current);
- spin_unlock(&current->pi_lock);
- }
- return 1;
+ return lock_is_stealable(pendowner, mode);
}

/*
@@ -486,74 +304,145 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
}

/*
- * Task blocks on lock.
- *
- * Prepare waiter and propagate pi chain
- *
- * This must be called with lock->wait_lock held.
+ * These callbacks are invoked whenever a waiter has changed priority.
+ * So we should requeue it within the lock->wait_list
*/
-static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- int detect_deadlock, unsigned long flags)
+
+static inline int rtmutex_waiter_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct task_struct *owner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *top_waiter = waiter;
- int chain_walk = 0, res;
+ struct rt_mutex_waiter *waiter;

- spin_lock(&current->pi_lock);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
- waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ waiter = container_of(snk, struct rt_mutex_waiter, pi.snk);

- /* Get the top priority waiter on the lock */
- if (rt_mutex_has_waiters(lock))
- top_waiter = rt_mutex_top_waiter(lock);
- plist_add(&waiter->list_entry, &lock->wait_list);
+ /*
+ * We dont need to take any locks here because the
+ * waiter->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ waiter->pi.prio = *src->prio;

- current->pi_blocked_on = waiter;
+ return 0;
+}

- spin_unlock(&current->pi_lock);
+static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
+ unsigned int flags)
+{
+ struct rt_mutex *lock;
+ struct rt_mutex_waiter *waiter;
+ unsigned long iflags;

- if (waiter == rt_mutex_top_waiter(lock)) {
- /* readers are handled differently */
- if (task_is_reader(owner)) {
- res = rt_mutex_adjust_readers(lock, waiter,
- current, lock, 0);
- return res;
- }
+ waiter = container_of(snk, struct rt_mutex_waiter, pi.snk);
+ lock = waiter->lock;

- spin_lock(&owner->pi_lock);
- plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
- plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- __rt_mutex_adjust_prio(owner);
- if (owner->pi_blocked_on)
- chain_walk = 1;
- spin_unlock(&owner->pi_lock);
+ /*
+ * If waiter->task is non-NULL, it means we are still valid in the
+ * pi list. Therefore, if waiter->pi.prio has changed since we
+ * queued ourselves, requeue it.
+ */
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
}
- else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
- chain_walk = 1;

- if (!chain_walk || task_is_reader(owner))
- return 0;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);
+
+ return 0;
+}
+
+static struct pi_sink rtmutex_waiter_pi_snk = {
+ .boost = rtmutex_waiter_pi_boost,
+ .update = rtmutex_waiter_pi_update,
+};
+
+/*
+ * This must be called with lock->wait_lock held.
+ */
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ int has_waiters = rt_mutex_has_waiters(lock);
+
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+ waiter->pi.snk = rtmutex_waiter_pi_snk;

/*
- * The owner can't disappear while holding a lock,
- * so the owner struct is protected by wait_lock.
- * Gets dropped in rt_mutex_adjust_prio_chain()!
+ * Link the waiter object to the task so that we can adjust our
+ * position on the prio list if the priority is changed. Note
+ * that if the priority races between the time we recorded it
+ * above and the time it is set here, we will correct the race
+ * when we task_pi_update(current) below. Otherwise the the
+ * update is a no-op
*/
- get_task_struct(owner);
+ pi_add_sink(&current->pi.node, &waiter->pi.snk,
+ PI_FLAG_DEFER_UPDATE);

- spin_unlock_irqrestore(&lock->wait_lock, flags);
+ /*
+ * Link the lock object to the waiter so that we can form a chain
+ * to the owner
+ */
+ pi_add_sink(&current->pi.node, &lock->pi.node.snk,
+ PI_FLAG_DEFER_UPDATE);

- res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current, 0);
+ /*
+ * If we previously had no waiters, we are transitioning to
+ * a mode where we need to boost the owner
+ */
+ if (!has_waiters) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 1);
+ }

- spin_lock_irq(&lock->wait_lock);
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ /*
+ * We can stop boosting the owner if there are no more waiters
+ */
+ if (!rt_mutex_has_waiters(lock)) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 0);
+ }

- return res;
+ /*
+ * Unlink the lock object from the waiter
+ */
+ pi_del_sink(&p->pi.node, &lock->pi.node.snk, PI_FLAG_DEFER_UPDATE);
+
+ /*
+ * Unlink the waiter object from the task. Note that we
+ * technically do not need an update for "p" because the
+ * .deboost will be processed synchronous to this call
+ * since there is no .deboost handler registered for
+ * the waiter sink
+ */
+ pi_del_sink(&p->pi.node, &waiter->pi.snk, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -566,24 +455,10 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
*/
static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
- struct rt_mutex_waiter *waiter;
- struct task_struct *pendowner;
- struct rt_mutex_waiter *next;
-
- spin_lock(&current->pi_lock);
+ struct rt_mutex_waiter *waiter = rt_mutex_top_waiter(lock);
+ struct task_struct *pendowner = waiter->task;

- waiter = rt_mutex_top_waiter(lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
-
- /*
- * Remove it from current->pi_waiters. We do not adjust a
- * possible priority boost right now. We execute wakeup in the
- * boosted mode and go back to normal after releasing
- * lock->wait_lock.
- */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- pendowner = waiter->task;
- waiter->task = NULL;
+ remove_waiter(lock, waiter);

/*
* Do the wakeup before the ownership change to give any spinning
@@ -621,113 +496,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
}

rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
-
- spin_unlock(&current->pi_lock);
-
- /*
- * Clear the pi_blocked_on variable and enqueue a possible
- * waiter into the pi_waiters list of the pending owner. This
- * prevents that in case the pending owner gets unboosted a
- * waiter with higher priority than pending-owner->normal_prio
- * is blocked on the unboosted (pending) owner.
- */
-
- if (rt_mutex_has_waiters(lock))
- next = rt_mutex_top_waiter(lock);
- else
- next = NULL;
-
- spin_lock(&pendowner->pi_lock);
-
- WARN_ON(!pendowner->pi_blocked_on);
- WARN_ON(pendowner->pi_blocked_on != waiter);
- WARN_ON(pendowner->pi_blocked_on->lock != lock);
-
- pendowner->pi_blocked_on = NULL;
-
- if (next)
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
-
- spin_unlock(&pendowner->pi_lock);
-}
-
-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long flags)
-{
- int first = (waiter == rt_mutex_top_waiter(lock));
- struct task_struct *owner = rt_mutex_owner(lock);
- int chain_walk = 0;
-
- spin_lock(&current->pi_lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
- current->pi_blocked_on = NULL;
- spin_unlock(&current->pi_lock);
-
- if (first && owner != current && !task_is_reader(owner)) {
-
- spin_lock(&owner->pi_lock);
-
- plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
-
- if (rt_mutex_has_waiters(lock)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(lock);
- plist_add(&next->pi_list_entry, &owner->pi_waiters);
- }
- __rt_mutex_adjust_prio(owner);
-
- if (owner->pi_blocked_on)
- chain_walk = 1;
-
- spin_unlock(&owner->pi_lock);
- }
-
- WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
- if (!chain_walk)
- return;
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(owner);
-
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current, 0);
-
- spin_lock_irq(&lock->wait_lock);
-}
-
-/*
- * Recheck the pi chain, in case we got a priority setting
- *
- * Called from sched_setscheduler
- */
-void rt_mutex_adjust_pi(struct task_struct *task)
-{
- struct rt_mutex_waiter *waiter;
- unsigned long flags;
-
- spin_lock_irqsave(&task->pi_lock, flags);
-
- waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- return;
- }
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
-
- rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task, 0);
}

/*
@@ -869,7 +637,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* but the lock got stolen by an higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(lock, &waiter, 0, flags);
+ add_waiter(lock, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -917,7 +685,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* can end up with a non-NULL waiter.task:
*/
if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);
/*
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up:
@@ -927,6 +695,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unlock:
spin_unlock_irqrestore(&lock->wait_lock, flags);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -954,8 +725,8 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
@@ -1126,6 +897,9 @@ static inline void
rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
{
list_add(&rls->list, &rwm->readers);
+
+ pi_source_init(&rls->pi_src, &rwm->prio);
+ task_pi_boost(rls->task, &rls->pi_src, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -1249,21 +1023,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
waiter = rt_mutex_top_waiter(mutex);
if (!lock_is_stealable(waiter->task, mode))
return 0;
- /*
- * The pending reader has PI waiters,
- * but we are taking the lock.
- * Remove the waiters from the pending owner.
- */
- spin_lock(&mtxowner->pi_lock);
- plist_del(&waiter->pi_list_entry, &mtxowner->pi_waiters);
- spin_unlock(&mtxowner->pi_lock);
}
- } else if (rt_mutex_has_waiters(mutex)) {
- /* Readers do things differently with respect to PI */
- waiter = rt_mutex_top_waiter(mutex);
- spin_lock(&current->pi_lock);
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
}
/* Readers never own the mutex */
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
@@ -1275,7 +1035,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
if (incr) {
atomic_inc(&rwm->owners);
rw_check_held(rwm);
- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
rls = &current->owned_read_locks[reader_count];
@@ -1285,10 +1045,11 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();
}
rt_mutex_deadlock_account_lock(mutex, current);
atomic_inc(&rwm->count);
+
return 1;
}

@@ -1378,7 +1139,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1417,7 +1178,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

WARN_ON(rt_mutex_owner(mutex) &&
rt_mutex_owner(mutex) != current &&
@@ -1430,6 +1191,9 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -1457,13 +1221,13 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);
local_irq_save(flags);
- spin_lock(&current->pi_lock);
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
current->owned_read_locks[reader_count].lock = rwm;
current->owned_read_locks[reader_count].count = 1;
} else
WARN_ON_ONCE(1);
+
/*
* If this task is no longer the sole owner of the lock
* or someone is blocking, then we need to add the task
@@ -1473,16 +1237,12 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
struct rt_mutex *mutex = &rwm->mutex;
struct reader_lock_struct *rls;

- /* preserve lock order, we only need wait_lock now */
- spin_unlock(&current->pi_lock);
-
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- rt_rwlock_add_reader(rlw, rwm);
+ rt_rwlock_add_reader(rls, rwm);
spin_unlock(&mutex->wait_lock);
- } else
- spin_unlock(&current->pi_lock);
+ }
local_irq_restore(flags);
return 1;
}
@@ -1591,7 +1351,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1630,7 +1390,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

/* check on unlock if we have any waiters. */
if (rt_mutex_has_waiters(mutex))
@@ -1642,6 +1402,9 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);

}
@@ -1733,7 +1496,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

for (i = current->reader_lock_count - 1; i >= 0; i--) {
if (current->owned_read_locks[i].lock == rwm) {
- spin_lock(&current->pi_lock);
+ preempt_disable();
current->owned_read_locks[i].count--;
if (!current->owned_read_locks[i].count) {
current->reader_lock_count--;
@@ -1743,9 +1506,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
WARN_ON(!rls->list.prev || list_empty(&rls->list));
list_del_init(&rls->list);
rls->lock = NULL;
+ task_pi_deboost(current, &rls->pi_src,
+ PI_FLAG_DEFER_UPDATE);
rw_check_held(rwm);
}
- spin_unlock(&current->pi_lock);
+ preempt_enable();
break;
}
}
@@ -1776,7 +1541,6 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

/* If no one is blocked, then clear all ownership */
if (!rt_mutex_has_waiters(mutex)) {
- rwm->prio = MAX_PRIO;
/*
* If count is not zero, we are under the limit with
* no other readers.
@@ -1835,28 +1599,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
}

- if (rt_mutex_has_waiters(mutex)) {
- waiter = rt_mutex_top_waiter(mutex);
- rwm->prio = waiter->task->prio;
- /*
- * If readers still own this lock, then we need
- * to update the pi_list too. Readers have a separate
- * path in the PI chain.
- */
- if (reader_count) {
- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->pi_list_entry,
- &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
- }
- } else
- rwm->prio = MAX_PRIO;
-
out:
spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -1874,9 +1621,9 @@ rt_read_fastunlock(struct rw_mutex *rwm,
int reader_count;
int owners;

- spin_lock_irqsave(&current->pi_lock, flags);
+ local_irq_save(flags);
reader_count = --current->reader_lock_count;
- spin_unlock_irqrestore(&current->pi_lock, flags);
+ local_irq_restore(flags);

rt_mutex_deadlock_account_unlock(current);
if (unlikely(reader_count < 0)) {
@@ -1972,17 +1719,7 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- spin_lock(&reader->pi_lock);
- waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);
+ remove_waiter(mutex, waiter);

if (savestate)
wake_up_process_mutex(reader);
@@ -1995,32 +1732,12 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- spin_lock(&pendowner->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
-
- /* This could also be a reader (if reader_limit is set) */
- if (next->write_lock)
- /* add back in as top waiter */
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- rwm->prio = next->task->prio;
- } else
- rwm->prio = MAX_PRIO;
-
out:

spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -2068,7 +1785,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);

- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
rls = &current->owned_read_locks[reader_count];
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
@@ -2076,12 +1793,11 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
rls->count = 1;
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();

if (!rt_mutex_has_waiters(mutex)) {
/* We are sole owner, we are done */
rwm->owner = current;
- rwm->prio = MAX_PRIO;
mutex->owner = NULL;
spin_unlock_irqrestore(&mutex->wait_lock, flags);
return;
@@ -2102,17 +1818,8 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&current->pi_lock);
plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
-
- spin_lock(&reader->pi_lock);
waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);

/* downgrade is only for mutexes */
wake_up_process(reader);
@@ -2123,124 +1830,81 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- /* setup this mutex prio for read */
- rwm->prio = next->task->prio;
-
- spin_lock(&current->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
- /* No need to add back since readers don't have PI waiters */
- } else
- rwm->prio = MAX_PRIO;
-
rt_mutex_set_owner(mutex, RT_RW_READER, 0);

spin_unlock_irqrestore(&mutex->wait_lock, flags);
-
- /*
- * Undo pi boosting when necessary.
- * If one of the awoken readers boosted us, we don't want to keep
- * that priority.
- */
- rt_mutex_adjust_prio(current);
-}
-
-void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
-{
- struct rt_mutex *mutex = &rwm->mutex;
-
- rwm->owner = NULL;
- atomic_set(&rwm->count, 0);
- atomic_set(&rwm->owners, 0);
- rwm->prio = MAX_PRIO;
- INIT_LIST_HEAD(&rwm->readers);
-
- __rt_mutex_init(mutex, name);
}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+/*
+ * These callbacks are invoked whenever a rwlock has changed priority.
+ * Since rwlocks maintain their own lists of reader dependencies, we
+ * may need to reboost any readers manually
+ */
+static inline int rt_rwlock_pi_boost(struct pi_sink *snk,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct reader_lock_struct *rls;
struct rw_mutex *rwm;
- int lock_prio;
- int i;

- for (i = 0; i < task->reader_lock_count; i++) {
- rls = &task->owned_read_locks[i];
- rwm = rls->lock;
- if (rwm) {
- lock_prio = rwm->prio;
- if (prio > lock_prio)
- prio = lock_prio;
- }
- }
+ rwm = container_of(snk, struct rw_mutex, pi_snk);

- return prio;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ rwm->prio = *src->prio;
+
+ return 0;
}

-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
+static inline int rt_rwlock_pi_update(struct pi_sink *snk,
+ unsigned int flags)
{
+ struct rw_mutex *rwm;
+ struct rt_mutex *mutex;
struct reader_lock_struct *rls;
- struct rt_mutex_waiter *waiter;
- struct task_struct *task;
- struct rw_mutex *rwm = container_of(lock, struct rw_mutex, mutex);
+ unsigned long iflags;

- if (rt_mutex_has_waiters(lock)) {
- waiter = rt_mutex_top_waiter(lock);
- /*
- * Do we need to grab the task->pi_lock?
- * Really, we are only reading it. If it
- * changes, then that should follow this chain
- * too.
- */
- rwm->prio = waiter->task->prio;
- } else
- rwm->prio = MAX_PRIO;
+ rwm = container_of(snk, struct rw_mutex, pi_snk);
+ mutex = &rwm->mutex;

- if (recursion_depth >= MAX_RWLOCK_DEPTH) {
- WARN_ON(1);
- return 1;
- }
+ spin_lock_irqsave(&mutex->wait_lock, iflags);

- list_for_each_entry(rls, &rwm->readers, list) {
- task = rls->task;
- get_task_struct(task);
- /*
- * rt_mutex_adjust_prio_chain will do
- * the put_task_struct
- */
- rt_mutex_adjust_prio_chain(task, 0, orig_lock,
- orig_waiter, top_task,
- recursion_depth+1);
- }
+ list_for_each_entry(rls, &rwm->readers, list)
+ task_pi_boost(rls->task, &rls->pi_src, 0);
+
+ spin_unlock_irqrestore(&mutex->wait_lock, iflags);

return 0;
}
-#else
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
-{
- return 0;
-}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+static struct pi_sink rt_rwlock_pi_snk = {
+ .boost = rt_rwlock_pi_boost,
+ .update = rt_rwlock_pi_update,
+};
+
+void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
{
- return prio;
+ struct rt_mutex *mutex = &rwm->mutex;
+
+ rwm->owner = NULL;
+ atomic_set(&rwm->count, 0);
+ atomic_set(&rwm->owners, 0);
+ rwm->prio = MAX_PRIO;
+ INIT_LIST_HEAD(&rwm->readers);
+
+ __rt_mutex_init(mutex, name);
+
+ /*
+ * Link the rwlock object to the mutex so we get notified
+ * of any priority changes in the future
+ */
+ rwm->pi_snk = rt_rwlock_pi_snk;
+ pi_add_sink(&mutex->pi.node, &rwm->pi_snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
}
+
#endif /* CONFIG_PREEMPT_RT */

static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags)
@@ -2335,8 +1999,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
- detect_deadlock, flags);
+ ret = add_waiter(lock, &waiter, &flags);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
@@ -2374,7 +2037,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
set_current_state(TASK_RUNNING);

if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);

/*
* try_to_take_rt_mutex() sets the waiter bit
@@ -2388,13 +2051,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(timeout))
hrtimer_cancel(&timeout->timer);

- /*
- * Readjust priority, when we did not get the lock. We might
- * have been the pending owner and boosted. Since we did not
- * take the lock, the PI boost has to go.
- */
- if (unlikely(ret))
- rt_mutex_adjust_prio(current);
+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);

/* Must we reaquire the BKL? */
if (unlikely(saved_lock_depth >= 0))
@@ -2457,8 +2115,8 @@ rt_mutex_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

/*
@@ -2654,6 +2312,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
spin_lock_init(&lock->wait_lock);
plist_head_init(&lock->wait_list, &lock->wait_lock);

+ init_pi(lock);
+
debug_rt_mutex_init(lock, name);
}
EXPORT_SYMBOL_GPL(__rt_mutex_init);
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 70df5f5..7bf32d0 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -14,6 +14,7 @@

#include <linux/rtmutex.h>
#include <linux/rt_lock.h>
+#include <linux/pi.h>

/*
* The rtmutex in kernel tester is independent of rtmutex debugging. We
@@ -48,10 +49,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
*/
struct rt_mutex_waiter {
struct plist_node list_entry;
- struct plist_node pi_list_entry;
struct task_struct *task;
struct rt_mutex *lock;
int write_lock;
+ struct {
+ struct pi_sink snk;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;
struct pid *deadlock_task_pid;
@@ -79,18 +83,6 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
return w;
}

-static inline int task_has_pi_waiters(struct task_struct *p)
-{
- return !plist_head_empty(&p->pi_waiters);
-}
-
-static inline struct rt_mutex_waiter *
-task_top_pi_waiter(struct task_struct *p)
-{
- return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
- pi_list_entry);
-}
-
/*
* lock->owner state tracking:
*/
diff --git a/kernel/rwlock_torture.c b/kernel/rwlock_torture.c
index 2820815..689a0d0 100644
--- a/kernel/rwlock_torture.c
+++ b/kernel/rwlock_torture.c
@@ -682,37 +682,7 @@ static int __init mutex_stress_init(void)

print_owned_read_locks(tsks[i]);

- if (tsks[i]->pi_blocked_on) {
- w = (void *)tsks[i]->pi_blocked_on;
- mtx = w->lock;
- spin_unlock_irq(&tsks[i]->pi_lock);
- spin_lock_irq(&mtx->wait_lock);
- spin_lock(&tsks[i]->pi_lock);
- own = (unsigned long)mtx->owner & ~3UL;
- oops_in_progress++;
- printk("%s:%d is blocked on ",
- tsks[i]->comm, tsks[i]->pid);
- __print_symbol("%s", (unsigned long)mtx);
- if (own == 0x100)
- printk(" owner is READER\n");
- else if (!(own & ~300))
- printk(" owner is ILLEGAL!!\n");
- else if (!own)
- printk(" has no owner!\n");
- else {
- struct task_struct *owner = (void*)own;
-
- printk(" owner is %s:%d\n",
- owner->comm, owner->pid);
- }
- oops_in_progress--;
-
- spin_unlock(&tsks[i]->pi_lock);
- spin_unlock_irq(&mtx->wait_lock);
- } else {
- print_owned_read_locks(tsks[i]);
- spin_unlock_irq(&tsks[i]->pi_lock);
- }
+ spin_unlock_irq(&tsks[i]->pi_lock);
}
}
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 729139d..a373250 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2413,12 +2413,6 @@ task_pi_init(struct task_struct *p)
pi_source_init(&p->pi.src, &p->normal_prio);
task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);

-#ifdef CONFIG_RT_MUTEXES
- p->rtmutex_prio = MAX_PRIO;
- pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
- task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
-#endif
-
/*
* We add our own task as a dependency of ourselves so that
* we get boost-notifications (via task_pi_boost_cb) whenever
@@ -5029,7 +5023,6 @@ task_pi_update_cb(struct pi_sink *snk, unsigned int flags)
*/
if (unlikely(p == rq->idle)) {
WARN_ON(p != rq->curr);
- WARN_ON(p->pi_blocked_on);
goto out_unlock;
}

@@ -5360,7 +5353,6 @@ recheck:
spin_unlock_irqrestore(&p->pi_lock, flags);

task_pi_update(p, 0);
- rt_mutex_adjust_pi(p);

return 0;
}
@@ -8494,10 +8486,6 @@ void __init sched_init(void)

task_pi_init(&init_task);

-#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
-#endif
-
/*
* The boot idle thread does lazy MMU switching as well:
*/

2008-08-15 12:20:37

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 2/8] sched: add the basic PI infrastructure to the task_struct

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place. Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 37 +++++++--
include/linux/workqueue.h | 2
kernel/fork.c | 1
kernel/rcupreempt-boost.c | 23 +-----
kernel/rtmutex.c | 6 +
kernel/sched.c | 188 ++++++++++++++++++++++++++++++++-------------
kernel/workqueue.c | 39 ++++++++-
7 files changed, 206 insertions(+), 90 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..63ddd1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
#include <linux/task_io_accounting.h>
#include <linux/kobject.h>
#include <linux/latencytop.h>
+#include <linux/pi.h>

#include <asm/processor.h>

@@ -1125,6 +1126,7 @@ struct task_struct {
int prio, static_prio, normal_prio;
#ifdef CONFIG_PREEMPT_RCU_BOOST
int rcu_prio;
+ struct pi_source rcu_prio_src;
#endif
const struct sched_class *sched_class;
struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;

+ struct {
+ struct pi_source src; /* represents normal_prio to 'this' */
+ struct pi_node node;
+ struct pi_sink snk; /* registered to 'this' to get updates */
+ int prio;
+ } pi;
+
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
struct plist_head pi_waiters;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
+ int rtmutex_prio;
+ struct pi_source rtmutex_prio_src;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
#endif
};

+static inline int
+task_pi_boost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_boost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_deboost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_deboost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_update(struct task_struct *p, unsigned int flags)
+{
+ return pi_update(&p->pi.node, flags);
+}
+
#ifdef CONFIG_PREEMPT_RT
# define set_printk_might_sleep(x) do { current->in_printk = x; } while(0)
#else
@@ -1774,14 +1805,8 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-extern void task_setprio(struct task_struct *p, int prio);
-
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
-static inline void rt_mutex_setprio(struct task_struct *p, int prio)
-{
- task_setprio(p, prio);
-}
extern void rt_mutex_adjust_pi(struct task_struct *p);
#else
static inline int rt_mutex_getprio(struct task_struct *p)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 229179e..3dc4ed9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -11,6 +11,7 @@
#include <linux/lockdep.h>
#include <linux/plist.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>
#include <asm/atomic.h>

struct workqueue_struct;
@@ -31,6 +32,7 @@ struct work_struct {
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct plist_node entry;
work_func_t func;
+ struct pi_source pi_src;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b49488d..399a0d0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -990,6 +990,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->rcu_flipctr_idx = 0;
#ifdef CONFIG_PREEMPT_RCU_BOOST
p->rcu_prio = MAX_PRIO;
+ pi_source_init(&p->rcu_prio_src, &p->rcu_prio);
p->rcub_rbdp = NULL;
p->rcub_state = RCU_BOOST_IDLE;
INIT_LIST_HEAD(&p->rcub_entry);
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index 5282b19..e8d9d76 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -232,14 +232,11 @@ static inline int rcu_is_boosted(struct task_struct *task)
static void rcu_boost_task(struct task_struct *task)
{
WARN_ON(!irqs_disabled());
- WARN_ON_SMP(!spin_is_locked(&task->pi_lock));

rcu_trace_boost_task_boost_called(RCU_BOOST_ME);

- if (task->rcu_prio < task->prio) {
+ if (task_pi_boost(task, &task->rcu_prio_src, 0))
rcu_trace_boost_task_boosted(RCU_BOOST_ME);
- task_setprio(task, task->rcu_prio);
- }
}

/**
@@ -275,26 +272,17 @@ void __rcu_preempt_boost(void)
rbd = &__get_cpu_var(rcu_boost_data);
spin_lock(&rbd->rbs_lock);

- spin_lock(&curr->pi_lock);
-
curr->rcub_rbdp = rbd;

rcu_trace_boost_try_boost(rbd);

- prio = rt_mutex_getprio(curr);
-
if (list_empty(&curr->rcub_entry))
list_add_tail(&curr->rcub_entry, &rbd->rbs_toboost);
- if (prio <= rbd->rbs_prio)
- goto out;
-
- rcu_trace_boost_boosted(curr->rcub_rbdp);

set_rcu_prio(curr, rbd->rbs_prio);
rcu_boost_task(curr);

out:
- spin_unlock(&curr->pi_lock);
spin_unlock_irqrestore(&rbd->rbs_lock, flags);
}

@@ -353,15 +341,12 @@ void __rcu_preempt_unboost(void)

rcu_trace_boost_unboosted(rbd);

- set_rcu_prio(curr, MAX_PRIO);
+ task_pi_deboost(curr, &curr->rcu_prio_src, 0);

- spin_lock(&curr->pi_lock);
- prio = rt_mutex_getprio(curr);
- task_setprio(curr, prio);
+ set_rcu_prio(curr, MAX_PRIO);

curr->rcub_rbdp = NULL;

- spin_unlock(&curr->pi_lock);
out:
spin_unlock_irqrestore(&rbd->rbs_lock, flags);
}
@@ -393,9 +378,7 @@ static int __rcu_boost_readers(struct rcu_boost_dat *rbd, int prio, unsigned lon
list_move_tail(&p->rcub_entry,
&rbd->rbs_boosted);
set_rcu_prio(p, prio);
- spin_lock(&p->pi_lock);
rcu_boost_task(p);
- spin_unlock(&p->pi_lock);

/*
* Now we release the lock to allow for a higher
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 377949a..7d11380 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -178,8 +178,10 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
{
int prio = rt_mutex_getprio(task);

- if (task->prio != prio)
- rt_mutex_setprio(task, prio);
+ if (task->rtmutex_prio != prio) {
+ task->rtmutex_prio = prio;
+ task_pi_boost(task, &task->rtmutex_prio_src, 0);
+ }
}

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 54ea580..c129b10 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1709,26 +1709,6 @@ static inline int normal_prio(struct task_struct *p)
}

/*
- * Calculate the current priority, i.e. the priority
- * taken into account by the scheduler. This value might
- * be boosted by RT tasks, or might be boosted by
- * interactivity modifiers. Will be RT if the task got
- * RT-boosted. If not then it returns p->normal_prio.
- */
-static int effective_prio(struct task_struct *p)
-{
- p->normal_prio = normal_prio(p);
- /*
- * If we are RT tasks or we were boosted to RT priority,
- * keep the priority unchanged. Otherwise, update priority
- * to the normal priority:
- */
- if (!rt_prio(p->prio))
- return p->normal_prio;
- return p->prio;
-}
-
-/*
* activate_task - move a task to the runqueue.
*/
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -2375,6 +2355,58 @@ static void __sched_fork(struct task_struct *p)
p->state = TASK_RUNNING;
}

+static int
+task_pi_boost_cb(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+
+ /*
+ * We dont need any locking here, since the .boost operation
+ * is already guaranteed to be mutually exclusive
+ */
+ p->pi.prio = *src->prio;
+
+ return 0;
+}
+
+static int task_pi_update_cb(struct pi_sink *snk, unsigned int flags);
+
+static struct pi_sink task_pi_sink = {
+ .boost = task_pi_boost_cb,
+ .update = task_pi_update_cb,
+};
+
+static inline void
+task_pi_init(struct task_struct *p)
+{
+ pi_node_init(&p->pi.node);
+
+ /*
+ * Feed our initial state of normal_prio into the PI infrastructure.
+ * We will update this whenever it changes
+ */
+ p->pi.prio = p->normal_prio;
+ pi_source_init(&p->pi.src, &p->normal_prio);
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+
+#ifdef CONFIG_RT_MUTEXES
+ p->rtmutex_prio = MAX_PRIO;
+ pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
+ task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
+#endif
+
+ /*
+ * We add our own task as a dependency of ourselves so that
+ * we get boost-notifications (via task_pi_boost_cb) whenever
+ * our priority is changed (locally e.g. setscheduler() or
+ * remotely via a pi-boost).
+ */
+ p->pi.snk = task_pi_sink;
+ pi_add_sink(&p->pi.node, &p->pi.snk,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}
+
/*
* fork()/clone()-time setup:
*/
@@ -2396,6 +2428,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

+ task_pi_init(p);
+
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
if (likely(sched_info_on()))
memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -2411,6 +2445,55 @@ void sched_fork(struct task_struct *p, int clone_flags)
}

/*
+ * In the past, task_setprio was exposed as an API. This variant is only
+ * meant to be called from pi_update functions (namely, task_updateprio() and
+ * task_pi_update_cb()). If you need to adjust the priority of a task,
+ * you should be using something like setscheduler() (permanent adjustments)
+ * or task_pi_boost() (temporary adjustments).
+ */
+static void
+task_setprio(struct task_struct *p, int prio)
+{
+ if (prio == p->prio)
+ return;
+
+ if (rt_prio(prio))
+ p->sched_class = &rt_sched_class;
+ else
+ p->sched_class = &fair_sched_class;
+
+ p->prio = prio;
+}
+
+static inline void
+task_updateprio(struct task_struct *p)
+{
+ int prio = normal_prio(p);
+
+ if (p->normal_prio != prio) {
+ p->normal_prio = prio;
+ set_load_weight(p);
+
+ /*
+ * Reboost our normal_prio entry, which will
+ * also chain-update any of our PI dependencies (of course)
+ * on our next update
+ */
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+ }
+
+ /*
+ * If normal_prio is logically higher than our current setting,
+ * just assign the priority/class immediately so that any callers
+ * will see the update as synchronous without dropping the rq-lock
+ * to do a pi_update. Any descrepancy with pending pi-updates will
+ * automatically be corrected after we drop the rq-lock.
+ */
+ if (p->normal_prio < p->prio)
+ task_setprio(p, p->normal_prio);
+}
+
+/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
@@ -2426,7 +2509,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
BUG_ON(p->state != TASK_RUNNING);
update_rq_clock(rq);

- p->prio = effective_prio(p);
+ task_updateprio(p);

if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);
@@ -2447,6 +2530,8 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
p->sched_class->task_wake_up(rq, p);
#endif
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -4887,27 +4972,25 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
EXPORT_SYMBOL(sleep_on_timeout);

/*
- * task_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * Invoked whenever our priority changes by the PI library
*
* This function changes the 'effective' priority of a task. It does
* not touch ->normal_prio like __setscheduler().
*
- * Used by the rt_mutex code to implement priority inheritance logic
- * and by rcupreempt-boost to boost priorities of tasks sleeping
- * with rcu locks.
*/
-void task_setprio(struct task_struct *p, int prio)
+static int
+task_pi_update_cb(struct pi_sink *snk, unsigned int flags)
{
- unsigned long flags;
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+ unsigned long iflags;
int oldprio, on_rq, running;
+ int prio = p->pi.prio;
struct rq *rq;
const struct sched_class *prev_class = p->sched_class;

BUG_ON(prio < 0 || prio > MAX_PRIO);

- rq = task_rq_lock(p, &flags);
+ rq = task_rq_lock(p, &iflags);

/*
* Idle task boosting is a nono in general. There is one
@@ -4929,6 +5012,10 @@ void task_setprio(struct task_struct *p, int prio)

update_rq_clock(rq);

+ /* If prio is not changing, bail */
+ if (prio == p->prio)
+ goto out_unlock;
+
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_current(rq, p);
@@ -4937,12 +5024,7 @@ void task_setprio(struct task_struct *p, int prio)
if (running)
p->sched_class->put_prev_task(rq, p);

- if (rt_prio(prio))
- p->sched_class = &rt_sched_class;
- else
- p->sched_class = &fair_sched_class;
-
- p->prio = prio;
+ task_setprio(p, prio);

// trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p));

@@ -4956,7 +5038,9 @@ void task_setprio(struct task_struct *p, int prio)
// trace_special(prev_resched, _need_resched(), 0);

out_unlock:
- task_rq_unlock(rq, &flags);
+ task_rq_unlock(rq, &iflags);
+
+ return 0;
}

void set_user_nice(struct task_struct *p, long nice)
@@ -4990,9 +5074,9 @@ void set_user_nice(struct task_struct *p, long nice)
}

p->static_prio = NICE_TO_PRIO(nice);
- set_load_weight(p);
old_prio = p->prio;
- p->prio = effective_prio(p);
+ task_updateprio(p);
+
delta = p->prio - old_prio;

if (on_rq) {
@@ -5007,6 +5091,8 @@ void set_user_nice(struct task_struct *p, long nice)
}
out_unlock:
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}
EXPORT_SYMBOL(set_user_nice);

@@ -5123,23 +5209,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
BUG_ON(p->se.on_rq);

p->policy = policy;
- switch (p->policy) {
- case SCHED_NORMAL:
- case SCHED_BATCH:
- case SCHED_IDLE:
- p->sched_class = &fair_sched_class;
- break;
- case SCHED_FIFO:
- case SCHED_RR:
- p->sched_class = &rt_sched_class;
- break;
- }
-
p->rt_priority = prio;
- p->normal_prio = normal_prio(p);
- /* we are holding p->pi_lock already */
- p->prio = rt_mutex_getprio(p);
- set_load_weight(p);
+
+ task_updateprio(p);
}

/**
@@ -5264,6 +5336,7 @@ recheck:
__task_rq_unlock(rq);
spin_unlock_irqrestore(&p->pi_lock, flags);

+ task_pi_update(p, 0);
rt_mutex_adjust_pi(p);

return 0;
@@ -6686,6 +6759,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
deactivate_task(rq, rq->idle, 0);
rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
+ rq->idle->prio = rq->idle->normal_prio;
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
@@ -8395,6 +8469,8 @@ void __init sched_init(void)
open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
#endif

+ task_pi_init(&init_task);
+
#ifdef CONFIG_RT_MUTEXES
plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
#endif
@@ -8460,7 +8536,9 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(rq, p, 0);
+
__setscheduler(rq, p, SCHED_NORMAL, 0);
+
if (on_rq) {
activate_task(rq, p, 0);
resched_task(rq->curr);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f37979..5cd4b0e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -145,8 +145,13 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
plist_node_init(&work->entry, prio);
plist_add(&work->entry, &cwq->worklist);

- if (boost_prio < cwq->thread->prio)
- task_setprio(cwq->thread, boost_prio);
+ /*
+ * FIXME: We want to boost to boost_prio, but we dont record that
+ * value in the work_struct for later deboosting
+ */
+ pi_source_init(&work->pi_src, &work->entry.prio);
+ task_pi_boost(cwq->thread, &work->pi_src, 0);
+
wake_up(&cwq->more_work);
}

@@ -280,6 +285,10 @@ struct wq_barrier {
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
struct plist_head *worklist = &cwq->worklist;
+ struct pi_source pi_src;
+ int prio;
+
+ pi_source_init(&pi_src, &prio);

spin_lock_irq(&cwq->lock);
cwq->run_depth++;
@@ -292,10 +301,10 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)

again:
while (!plist_head_empty(worklist)) {
- int prio;
struct work_struct *work = plist_first_entry(worklist,
struct work_struct, entry);
work_func_t f = work->func;
+
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct
@@ -316,14 +325,28 @@ again:
}
prio = max(prio, 0);

- if (likely(cwq->thread->prio != prio))
- task_setprio(cwq->thread, prio);
-
cwq->current_work = work;
plist_del(&work->entry, worklist);
plist_node_init(&work->entry, MAX_PRIO);
spin_unlock_irq(&cwq->lock);

+ /*
+ * The owner is free to reuse the work object once we execute
+ * the work->func() below. Therefore we cannot leave the
+ * work->pi_src boosting our thread or it may get stomped
+ * on when the work item is requeued.
+ *
+ * So what we do is we boost ourselves with an on-the
+ * stack copy of the priority of the work item, and then
+ * deboost the workitem. Once the work is complete, we
+ * can then simply deboost the stack version.
+ *
+ * Note that this will not typically cause a pi-chain
+ * update since we are boosting the node laterally
+ */
+ task_pi_boost(current, &pi_src, PI_FLAG_DEFER_UPDATE);
+ task_pi_deboost(current, &work->pi_src, PI_FLAG_DEFER_UPDATE);
+
BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
leak_check(NULL);
@@ -334,6 +357,9 @@ again:
lock_release(&cwq->wq->lockdep_map, 1, _THIS_IP_);
leak_check(f);

+ /* Deboost the stack copy of the work->prio (see above) */
+ task_pi_deboost(current, &pi_src, 0);
+
spin_lock_irq(&cwq->lock);
cwq->current_work = NULL;
wake_up_all(&cwq->work_done);
@@ -357,7 +383,6 @@ again:
goto again;
}

- task_setprio(cwq->thread, current->normal_prio);
cwq->run_depth--;
spin_unlock_irq(&cwq->lock);
}

2008-08-15 12:21:13

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 5/8] RT: wrap the rt_rwlock "add reader" logic

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
# define rw_check_held(rwm) do { } while (0)
#endif

+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+ list_add(&rls->list, &rwm->readers);
+}
+
/*
* The fast path does not add itself to the reader list to keep
* from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
if (rls->list.prev && !list_empty(&rls->list))
return;

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);

/* change to reader, so no one else updates too */
rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
* it hasn't been added to the link list yet.
*/
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
rls->count++;
incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rls->lock = rwm;
rls->count = 1;
WARN_ON(rls->list.prev && !list_empty(&rls->list));
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rlw, rwm);
spin_unlock(&mutex->wait_lock);
} else
spin_unlock(&current->pi_lock);
@@ -2083,7 +2089,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)

/* Set us up for multiple readers or conflicts */

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rwm->owner = RT_RW_READER;

/*

2008-08-15 12:21:32

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 0/8] Priority Inheritance enhancements

** RFC for PREEMPT_RT branch, 26-rt1 **

Synopsis: We gain a 13%+ IO improvement in the PREEMPT_RT kernel by
re-working some of the PI logic.

[
pi-enhancements v2

Changes since v1:

*) Added proper reference counting to prevent tasks from
deleting while a node->update() is still in flight
*) unified the RCU boost path
]


[
fyi -> you can find this series at the following URLs in
addition to this thread:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=shortlog;h=pi-rework

ftp://ftp.novell.com/dev/ghaskins/pi-rework-v2.tar.bz2

]

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches. This yields at
least a 13-15% improvement for diskio on my 4-way x86_64 system. An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
(at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
if their non-rt equivalent. We have determined via instrumentation
that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above. Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-7.

(I realize patch #7 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this more readable).

Patch 8 is the first real consumer of the libpi logic to try to enhance
performance. It accomplishes this by deferring pi-boosting a lock
owner unless it is absolutely necessary. Since instrumentation
shows that the majority of locks are acquired either via the fast-path,
or via the adaptive-spin path, we can eliminate most of the pi-overhead
with this technique. This yields a measurable performance gain (at least
13% for workloads with heavy lock contention was observed in our lab).

We have not yet completed the work on the pi-waitqueues or any of the other
related pi enhancements. Those will be coming in a follow-on announcement.

Feedback/comments welcome!

Regards,
-Greg


---

Gregory Haskins (8):
rtmutex: pi-boost locks as late as possible
rtmutex: convert rtmutexes to fully use the PI library
rtmutex: use runtime init for rtmutexes
RT: wrap the rt_rwlock "add reader" logic
rtmutex: formally initialize the rt_mutex_waiters
sched: rework task reference counting to work with the pi infrastructure
sched: add the basic PI infrastructure to the task_struct
add generalized priority-inheritance interface


Documentation/libpi.txt | 59 ++
include/linux/pi.h | 278 +++++++++++
include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 18 -
include/linux/sched.h | 57 +-
include/linux/workqueue.h | 2
kernel/fork.c | 35 +
kernel/rcupreempt-boost.c | 25 -
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 1091 ++++++++++++++++++---------------------------
kernel/rtmutex_common.h | 19 -
kernel/rwlock_torture.c | 32 -
kernel/sched.c | 209 ++++++---
kernel/workqueue.c | 39 +-
lib/Makefile | 3
lib/pi.c | 516 +++++++++++++++++++++
17 files changed, 1543 insertions(+), 850 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c


2008-08-15 12:20:51

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 3/8] sched: rework task reference counting to work with the pi infrastructure

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 5 +++--
kernel/fork.c | 32 +++++++++++++++-----------------
kernel/sched.c | 23 +++++++++++++++++++++++
3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63ddd1f..9132b42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
struct pi_node node;
struct pi_sink snk; /* registered to 'this' to get updates */
int prio;
+ struct rcu_head rcu; /* for destruction cleanup */
+
} pi;

#ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
call_rcu(&t->rcu, __put_task_struct_cb);
}
#else
-extern void __put_task_struct(struct task_struct *t);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ pi_dropref(&t->pi.node, 0);
}
#endif

diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..399a0a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);

-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
{
- struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(!tsk->exit_state);
WARN_ON(tsk == current);

+#ifdef CONFIG_PREEMPT_RT
+ WARN_ON(!tsk->exit_state);
+#else
+ WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
+
+#ifdef CONFIG_PREEMPT_RT
delayacct_tsk_free(tsk);
+#endif

if (!profile_handoff_task(tsk))
free_task(tsk);
}

-#else
-
-void __put_task_struct(struct task_struct *tsk)
+#ifdef CONFIG_PREEMPT_RT
+void __put_task_struct_cb(struct rcu_head *rhp)
{
- WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
- BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(tsk == current);
-
- security_task_free(tsk);
- free_uid(tsk->user);
- put_group_info(tsk->group_info);
+ struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ pi_dropref(&tsk->pi.node, 0);
}
+
#endif

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index c129b10..729139d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2370,11 +2370,34 @@ task_pi_boost_cb(struct pi_sink *snk, struct pi_source *src,
return 0;
}

+extern void prepare_free_task(struct task_struct *tsk);
+
+static void task_pi_free_rcu(struct rcu_head *rhp)
+{
+ struct task_struct *tsk = container_of(rhp, struct task_struct, pi.rcu);
+
+ prepare_free_task(tsk);
+}
+
+/*
+ * This function is invoked whenever the last references to a task have
+ * been dropped, and we should free the memory on the next rcu grace period
+ */
+static int task_pi_free_cb(struct pi_sink *snk, unsigned int flags)
+{
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+
+ call_rcu(&p->pi.rcu, task_pi_free_rcu);
+
+ return 0;
+}
+
static int task_pi_update_cb(struct pi_sink *snk, unsigned int flags);

static struct pi_sink task_pi_sink = {
.boost = task_pi_boost_cb,
.update = task_pi_update_cb,
+ .free = task_pi_free_cb,
};

static inline void

2008-08-15 12:29:05

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v2 6/8] rtmutex: use runtime init for rtmutexes

The system already has facilities to perform late/run-time init for
rtmutexes. We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rtmutex.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;

#define __RT_MUTEX_INITIALIZER(mutexname) \
{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
- , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
- , .owner = NULL \
__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}

#define DEFINE_RT_MUTEX(mutexname) \

2008-08-15 13:19:18

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v3] add generalized priority-inheritance interface

[
of course, 2 seconds after I hit "send" on v2 I realized there was a
race condition in libpi w.r.t. the sinkref->prio. Rather than spam
you guys with a full "v3" refresh of the series, here is a fixed
version of patch 1/8 which constitutes "v3" when used with patches
2-8 from v2.
]

The kernel currently addresses priority-inversion through priority-
inheritence. However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure. This causes a few
problems:

1) This tightly coupled relationship makes it difficult to extend to
other areas of the kernel (for instance, pi-aware wait-queues may
be desirable).
2) Enhancing the rtmutex infrastructure becomes challenging because
there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version. The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target. A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source. For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <[email protected]>
---

Documentation/libpi.txt | 59 +++++
include/linux/pi.h | 277 ++++++++++++++++++++++++++
lib/Makefile | 3
lib/pi.c | 509 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 847 insertions(+), 1 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a priority source, and its boosted target.
+
+A pi_source is simply a reference to some arbitrary priority value that may range from 0 (highest prio), to MAX_PRIO (currently 140, lowest prio). A pi_source calls pi_sink.boost() whenever it wishes to boost the sink to (at least minimally) the priority value that the source represents. It uses pi_sink.boost() for both the initial boosting, or for any subsequent refreshes to the value (even if the value is decreasing in logical priority). The policy of the sink will dictate what happens as a result of that boost. Likewise, a pi_source calls pi_sink.deboost() to stop contributing to the sink's minimum priority.
+
+It is important to note that a source is a reference to a priority value, not a value itself. This is one of the concepts that allows the interface to be idempotent, which is important for properly updating a chain of sources and sinks in the proper order. If we passed the priority on the stack, the order in which the system executes could allow the actual value that is set to race.
+
+Nodes:
+
+A pi_node is a convenience object which is simultaneously a source and a sink. As its name suggests, it would typically be deployed as a node in a pi-chain. Other pi_sources can boost a node via its pi_sink.boost() interface. Likewise, a node can boost a fixed number of sinks via the node.add_sink() interface.
+
+Generally speaking, a node takes care of many common operations associated with being a “link in the chain”, such as:
+
+ 1) determining the current priority of the node based on the (logically) highest priority source that is boosting the node.
+ 2) boosting/deboosting upstream sinks whenever the node locally changes priority.
+ 3) taking care to avoid deadlock during a chain update.
+
+Design details:
+
+Destruction:
+
+The pi-library objects are designed to be implicitly-destructable (meaning they do not require an explicit “free()” operation when they are not used anymore). This is important considering their intended use (spinlock_t's which are also implicitly-destructable). As such, any allocations needed for operation must come from internal structure storage as there will be no opportunity to free it later.
+
+Multiple sinks per Node:
+
+We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. “task->pi_blocked_on”). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called “leaf sinks”.
+
+Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
+
+The following diagram depicts an example relationship (warning: cheesy ascii art)
+
+ --------- ---------
+ | leaf | | leaf |
+ --------- ---------
+ / /
+ --------- / ---------- / --------- ---------
+ ->-| node |->---| node |-->---| node |->---| leaf |
+ --------- ---------- --------- ---------
+
+The reason why this was done was to unify the notion of a “sink” to a single interface, rather than having something like task->pi_blocks_on and a separate callback for the leaf action. Instead, any downstream object can be represented by a sink, and the implementation details are hidden (e.g. im a task, im a lock, im a node, im a work-item, im a wait-queue, etc).
+
+Sinkrefs:
+
+Each pi_sink.boost() operation is represented by a unique pi_source to properly facilitate a one node to many source relationship. Therefore, if a pi_node is to act as aggregator to multiple sinks, it implicitly must have one internal pi_source object for every sink that is added (via node.add_sink(). This pi_source object has to be internally managed for the lifetime of the sink reference.
+
+Recall that due to the implicit-destruction requirement above, and the fact that we will typically be executing in a preempt-disabled region, we have to be very careful about how we allocate references to those sinks. More on that next. But long story short we limit the number of sinks to MAX_PI_DEPENDENDICES (currently 5).
+
+Locking:
+
+(work in progress....)
+
+
+
+
+
diff --git a/include/linux/pi.h b/include/linux/pi.h
new file mode 100644
index 0000000..ed1fcf0
--- /dev/null
+++ b/include/linux/pi.h
@@ -0,0 +1,277 @@
+/*
+ * see Documentation/libpi.txt for details
+ */
+
+#ifndef _LINUX_PI_H
+#define _LINUX_PI_H
+
+#include <linux/list.h>
+#include <linux/plist.h>
+#include <asm/atomic.h>
+
+#define MAX_PI_DEPENDENCIES 5
+
+struct pi_source {
+ struct plist_node list;
+ int *prio;
+ int boosted;
+};
+
+
+#define PI_FLAG_DEFER_UPDATE (1 << 0)
+#define PI_FLAG_ALREADY_BOOSTED (1 << 1)
+#define PI_FLAG_NO_DROPREF (1 << 2)
+
+struct pi_sink {
+ atomic_t refs;
+ int (*boost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*deboost)(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags);
+ int (*update)(struct pi_sink *snk,
+ unsigned int flags);
+ int (*free)(struct pi_sink *snk,
+ unsigned int flags);
+};
+
+enum pi_state {
+ pi_state_boost,
+ pi_state_boosted,
+ pi_state_deboost,
+ pi_state_free,
+};
+
+/*
+ * NOTE: PI must always use a true (e.g. raw) spinlock, since it is used by
+ * rtmutex infrastructure.
+ */
+
+struct pi_sinkref {
+ raw_spinlock_t lock;
+ struct list_head list;
+ enum pi_state state;
+ struct pi_sink *snk;
+ struct pi_source src;
+ atomic_t refs;
+};
+
+struct pi_sinkref_pool {
+ struct list_head free;
+ struct pi_sinkref data[MAX_PI_DEPENDENCIES];
+ int count;
+};
+
+struct pi_node {
+ raw_spinlock_t lock;
+ int prio;
+ struct pi_sink snk;
+ struct pi_sinkref_pool sinkref_pool;
+ struct list_head snks;
+ struct plist_head srcs;
+};
+
+/**
+ * pi_node_init - initialize a pi_node before use
+ * @node: a node context
+ */
+extern void pi_node_init(struct pi_node *node);
+
+/**
+ * pi_add_sink - add a sink as an downstream object
+ * @node: the node context
+ * @snk: the sink context to add to the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ * PI_FLAG_ALREADY_BOOSTED - Do not perform initial boosting
+ *
+ * This function registers a sink to get notified whenever the
+ * node changes priority.
+ *
+ * Note: By default, this function will schedule the newly added sink
+ * to get an inital boost notification on the next update (even
+ * without the presence of a priority transition). However, if the
+ * ALREADY_BOOSTED flag is specified, the sink is initially marked as
+ * BOOSTED and will only get notified if the node changes priority
+ * in the future.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_add_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_del_sink - del a sink from the current downstream objects
+ * @node: the node context
+ * @snk: the sink context to delete from the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a sink from the node.
+ *
+ * Note: The sink will not actually become fully deboosted until
+ * a call to node.update() successfully returns.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_del_sink(struct pi_node *node, struct pi_sink *snk,
+ unsigned int flags);
+
+/**
+ * pi_source_init - initialize a pi_source before use
+ * @src: a src context
+ * @prio: pointer to a priority value
+ *
+ * A pointer to a priority value is used so that boost and update
+ * are fully idempotent.
+ */
+static inline void
+pi_source_init(struct pi_source *src, int *prio)
+{
+ plist_node_init(&src->list, *prio);
+ src->prio = prio;
+ src->boosted = 0;
+}
+
+/**
+ * pi_boost - boost a node with a pi_source
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function registers a priority source with the node, possibly
+ * boosting its value if the new source is the highest registered source.
+ *
+ * This function is used to both initially register a source, as well as
+ * to notify the node if the value changes in the future (even if the
+ * priority is decreasing).
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_boost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->boost)
+ return snk->boost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_deboost - deboost a pi_source from a node
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a priority source from the node, possibly
+ * deboosting its value if the departing source was the highest
+ * registered source.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_deboost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->deboost)
+ return snk->deboost(snk, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_update - force a manual chain update
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * This function will push any priority changes (as a result of
+ * boost/deboost or add_sink/del_sink) down through the chain.
+ * If no changes are necessary, this function is a no-op.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_update(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ if (snk->update)
+ return snk->update(snk, flags);
+
+ return 0;
+}
+
+/**
+ * pi_sink_dropref - down the reference count, freeing the sink if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_sink_dropref(struct pi_sink *snk, unsigned int flags)
+{
+ if (atomic_dec_and_test(&snk->refs)) {
+ if (snk->free)
+ snk->free(snk, flags);
+ }
+}
+
+
+/**
+ * pi_addref - up the reference count
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_addref(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ atomic_inc(&snk->refs);
+}
+
+/**
+ * pi_dropref - down the reference count, freeing the node if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_dropref(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *snk = &node->snk;
+
+ pi_sink_dropref(snk, flags);
+}
+
+#endif /* _LINUX_PI_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5187924..df81ad7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -23,7 +23,8 @@ lib-$(CONFIG_SMP) += cpumask.o
lib-y += kobject.o kref.o klist.o

obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
- bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o
+ bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
+ pi.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/pi.c b/lib/pi.c
new file mode 100644
index 0000000..74e4dad
--- /dev/null
+++ b/lib/pi.c
@@ -0,0 +1,509 @@
+/*
+ * lib/pi.c
+ *
+ * Priority-Inheritance library
+ *
+ * Copyright (C) 2008 Novell
+ *
+ * Author: Gregory Haskins <[email protected]>
+ *
+ * This code provides a generic framework for preventing priority
+ * inversion by means of priority-inheritance. (see Documentation/libpi.txt
+ * for details)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/pi.h>
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref_pool
+ *-----------------------------------------------------------
+ */
+
+static void
+pi_sinkref_pool_init(struct pi_sinkref_pool *pool)
+{
+ int i;
+
+ INIT_LIST_HEAD(&pool->free);
+ pool->count = 0;
+
+ for (i = 0; i < MAX_PI_DEPENDENCIES; ++i) {
+ struct pi_sinkref *sinkref = &pool->data[i];
+
+ memset(sinkref, 0, sizeof(*sinkref));
+ INIT_LIST_HEAD(&sinkref->list);
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+ }
+}
+
+static struct pi_sinkref *
+pi_sinkref_alloc(struct pi_sinkref_pool *pool)
+{
+ struct pi_sinkref *sinkref;
+
+ BUG_ON(!pool->count);
+
+ if (list_empty(&pool->free))
+ return NULL;
+
+ sinkref = list_first_entry(&pool->free, struct pi_sinkref, list);
+ list_del(&sinkref->list);
+ memset(sinkref, 0, sizeof(*sinkref));
+ pool->count--;
+
+ return sinkref;
+}
+
+static void
+pi_sinkref_free(struct pi_sinkref_pool *pool,
+ struct pi_sinkref *sinkref)
+{
+ list_add_tail(&sinkref->list, &pool->free);
+ pool->count++;
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref
+ *-----------------------------------------------------------
+ */
+
+static inline void
+_pi_sink_addref(struct pi_sinkref *sinkref)
+{
+ atomic_inc(&sinkref->snk->refs);
+ atomic_inc(&sinkref->refs);
+}
+
+static inline void
+_pi_sink_dropref_local(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ if (atomic_dec_and_lock(&sinkref->refs, &node->lock)) {
+ list_del(&sinkref->list);
+ pi_sinkref_free(&node->sinkref_pool, sinkref);
+ spin_unlock(&node->lock);
+ }
+}
+
+static inline void
+_pi_sink_dropref_all(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ struct pi_sink *snk = sinkref->snk;
+
+ _pi_sink_dropref_local(node, sinkref);
+ pi_sink_dropref(snk, 0);
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_node
+ *-----------------------------------------------------------
+ */
+
+static struct pi_node *node_of(struct pi_sink *snk)
+{
+ return container_of(snk, struct pi_node, snk);
+}
+
+static inline void
+__pi_boost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(src->boosted);
+
+ plist_node_init(&src->list, *src->prio);
+ plist_add(&src->list, &node->srcs);
+ src->boosted = 1;
+}
+
+static inline void
+__pi_deboost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(!src->boosted);
+
+ plist_del(&src->list, &node->srcs);
+ src->boosted = 0;
+}
+
+/*
+ * _pi_node_update - update the chain
+ *
+ * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
+ * that need to propagate up the chain. This is a step-wise process where we
+ * have to be careful about locking and preemption. By trying MAX_PI_DEPs
+ * times, we guarantee that this update routine is an effective barrier...
+ * all modifications made prior to the call to this barrier will have completed.
+ *
+ * Deadlock avoidance: This node may participate in a chain of nodes which
+ * form a graph of arbitrary structure. While the graph should technically
+ * never close on itself barring any bugs, we still want to protect against
+ * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
+ * from detecting this potential). To do this, we employ a dual-locking
+ * scheme where we can carefully control the order. That is: node->lock
+ * protects most of the node's internal state, but it will never be held
+ * across a chain update. sinkref->lock, on the other hand, can be held
+ * across a boost/deboost, and also guarantees proper execution order. Also
+ * note that no locks are held across an snk->update.
+ */
+static int
+_pi_node_update(struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ struct pi_sinkref *sinkref;
+ unsigned long iflags;
+ int count = 0;
+ int i;
+ int pprio;
+
+ struct updater {
+ int update;
+ struct pi_sinkref *sinkref;
+ struct pi_sink *snk;
+ } updaters[MAX_PI_DEPENDENCIES];
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ pprio = node->prio;
+
+ if (!plist_head_empty(&node->srcs))
+ node->prio = plist_first(&node->srcs)->prio;
+ else
+ node->prio = MAX_PRIO;
+
+ list_for_each_entry(sinkref, &node->snks, list) {
+ /*
+ * If the priority is changing, or if this is a
+ * BOOST/DEBOOST, we consider this sink "stale"
+ */
+ if (pprio != node->prio
+ || sinkref->state != pi_state_boosted) {
+ struct updater *iter = &updaters[count++];
+
+ BUG_ON(!atomic_read(&sinkref->snk->refs));
+ _pi_sink_addref(sinkref);
+
+ iter->update = 1;
+ iter->sinkref = sinkref;
+ iter->snk = sinkref->snk;
+ }
+ }
+
+ spin_unlock(&node->lock);
+
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+ unsigned int lflags = PI_FLAG_DEFER_UPDATE;
+ struct pi_sink *snk;
+
+ sinkref = iter->sinkref;
+ snk = iter->snk;
+
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ sinkref->state = pi_state_boosted;
+ /* Fall through */
+ case pi_state_boosted:
+ snk->boost(snk, &sinkref->src, lflags);
+ break;
+ case pi_state_deboost:
+ snk->deboost(snk, &sinkref->src, lflags);
+ sinkref->state = pi_state_free;
+
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * the above.
+ */
+ _pi_sink_dropref_all(node, sinkref);
+ break;
+ case pi_state_free:
+ iter->update = 0;
+ break;
+ default:
+ panic("illegal sinkref type: %d", sinkref->state);
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ /*
+ * We will drop the sinkref reference while still holding the
+ * preempt/irqs off so that the memory is returned synchronously
+ * to the system.
+ */
+ _pi_sink_dropref_local(node, sinkref);
+
+ /*
+ * The sinkref is no longer valid since we dropped the reference
+ * above, so symbolically drop it here too to make it more
+ * obvious if we try to use it later
+ */
+ iter->sinkref = NULL;
+ }
+
+ local_irq_restore(iflags);
+
+ /*
+ * Note: At this point, sinkref is invalid since we dropref'd
+ * it above, but snk is valid since we still hold the remote
+ * reference. This is key to the design because it allows us
+ * to synchronously free the sinkref object, yet maintain a
+ * reference to the sink across the update
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ if (iter->update)
+ iter->snk->update(iter->snk, 0);
+ }
+
+ /*
+ * We perform all the free opertations together at the end, using
+ * only automatic/stack variables since any one of these operations
+ * could result in our node object being deallocated
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ pi_sink_dropref(iter->snk, 0);
+ }
+
+ return 0;
+}
+
+static void
+_pi_del_sinkref(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ struct pi_sink *snk = sinkref->snk;
+ int remove = 0;
+ unsigned long iflags;
+
+ local_irq_save(iflags);
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ /*
+ * This state indicates the sink was never formally
+ * boosted so we can just delete it immediately
+ */
+ remove = 1;
+ break;
+ case pi_state_boosted:
+ if (snk->deboost)
+ /*
+ * If the sink supports deboost notification,
+ * schedule it for deboost at the next update
+ */
+ sinkref->state = pi_state_deboost;
+ else
+ /*
+ * ..otherwise schedule it for immediate
+ * removal
+ */
+ remove = 1;
+ break;
+ default:
+ break;
+ }
+
+ if (remove) {
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * when the caller performed the lookup
+ */
+ _pi_sink_dropref_all(node, sinkref);
+ sinkref->state = pi_state_free;
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ _pi_sink_dropref_local(node, sinkref);
+ local_irq_restore(iflags);
+
+ pi_sink_dropref(snk, 0);
+}
+
+static int
+_pi_node_boost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ if (src->boosted)
+ __pi_deboost(node, src);
+ __pi_boost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_deboost(struct pi_sink *snk, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ __pi_deboost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(snk, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_free(struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_node *node = node_of(snk);
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ /*
+ * When the node is freed, we should perform an implicit
+ * del_sink on any remaining sinks we may have
+ */
+ list_for_each_entry(sinkref, &node->snks, list) {
+ _pi_sink_addref(sinkref);
+ sinkrefs[count++] = sinkref;
+ }
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ for (i = 0; i < count; ++i)
+ _pi_del_sinkref(node, sinkrefs[i]);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+}
+
+static struct pi_sink pi_node_snk = {
+ .boost = _pi_node_boost,
+ .deboost = _pi_node_deboost,
+ .update = _pi_node_update,
+ .free = _pi_node_free,
+};
+
+void pi_node_init(struct pi_node *node)
+{
+ spin_lock_init(&node->lock);
+ node->prio = MAX_PRIO;
+ node->snk = pi_node_snk;
+ pi_sinkref_pool_init(&node->sinkref_pool);
+ INIT_LIST_HEAD(&node->snks);
+ plist_head_init(&node->srcs, &node->lock);
+ atomic_set(&node->snk.refs, 1);
+}
+
+int pi_add_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ int ret = 0;
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ if (!atomic_read(&node->snk.refs)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ sinkref = pi_sinkref_alloc(&node->sinkref_pool);
+ if (!sinkref) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_init(&sinkref->lock);
+ INIT_LIST_HEAD(&sinkref->list);
+
+ if (flags & PI_FLAG_ALREADY_BOOSTED)
+ sinkref->state = pi_state_boosted;
+ else
+ /*
+ * Schedule it for addition at the next update
+ */
+ sinkref->state = pi_state_boost;
+
+ pi_source_init(&sinkref->src, &node->prio);
+ sinkref->snk = snk;
+
+ /* set one ref from ourselves. It will be dropped on del_sink */
+ atomic_inc(&sinkref->snk->refs);
+ atomic_set(&sinkref->refs, 1);
+
+ list_add_tail(&sinkref->list, &node->snks);
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+
+ out:
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ return ret;
+}
+
+int pi_del_sink(struct pi_node *node, struct pi_sink *snk, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ struct pi_sinkref *sinkrefs[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ /*
+ * There may be multiple matches to snk because sometimes a
+ * deboost/free may still be pending an update when the same
+ * node has been added. So we want to process all instances
+ */
+ list_for_each_entry(sinkref, &node->snks, list) {
+ if (sinkref->snk == snk) {
+ _pi_sink_addref(sinkref);
+ sinkrefs[count++] = sinkref;
+ }
+ }
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ for (i = 0; i < count; ++i)
+ _pi_del_sinkref(node, sinkrefs[i]);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->snk, 0);
+
+ return 0;
+}
+
+
+

2008-08-15 20:30:42

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 0/8] Priority Inheritance enhancements

** RFC for PREEMPT_RT branch, 26-rt1 **

Synopsis: We gain a 13%+ IO improvement in the PREEMPT_RT kernel by
re-working some of the PI logic.

[
Changelog:

v4:

1) Incorporated review comments

*) Fixed checkpatch warning about extern in .c
*) Renamed s/snk/sink
*) Renamed s/addref/get
*) Renamed s/dropref/put
*) Made pi_sink use static *ops

2) Fixed a bug w.r.t. enabling interrupts too early

v3:

*) fixed a race with sinkref->prio

v2:

*) Added proper reference counting to prevent tasks from
deleting while a node->update() is still in flight
*) unified the RCU boost path

v1:

*) initial release
]


[
fyi -> you can find this series at the following URLs in
addition to this thread:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=shortlog;h=pi-rework

ftp://ftp.novell.com/dev/ghaskins/pi-rework.tar.bz2

]

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches. This yields at
least a 13-15% improvement for diskio on my 4-way x86_64 system. An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
(at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
if their non-rt equivalent. We have determined via instrumentation
that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above. Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-7.

(I realize patch #7 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this more readable).

Patch 8 is the first real consumer of the libpi logic to try to enhance
performance. It accomplishes this by deferring pi-boosting a lock
owner unless it is absolutely necessary. Since instrumentation
shows that the majority of locks are acquired either via the fast-path,
or via the adaptive-spin path, we can eliminate most of the pi-overhead
with this technique. This yields a measurable performance gain (at least
13% for workloads with heavy lock contention was observed in our lab).

We have not yet completed the work on the pi-waitqueues or any of the other
related pi enhancements. Those will be coming in a follow-on announcement.

Feedback/comments welcome!

Regards,
-Greg


---

Gregory Haskins (8):
rtmutex: pi-boost locks as late as possible
rtmutex: convert rtmutexes to fully use the PI library
rtmutex: use runtime init for rtmutexes
RT: wrap the rt_rwlock "add reader" logic
rtmutex: formally initialize the rt_mutex_waiters
sched: rework task reference counting to work with the pi infrastructure
sched: add the basic PI infrastructure to the task_struct
add generalized priority-inheritance interface


Documentation/libpi.txt | 59 ++
include/linux/pi.h | 278 +++++++++++
include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 18 -
include/linux/sched.h | 57 +-
include/linux/workqueue.h | 2
kernel/fork.c | 35 +
kernel/rcupreempt-boost.c | 25 -
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 1091 ++++++++++++++++++---------------------------
kernel/rtmutex_common.h | 19 -
kernel/rwlock_torture.c | 32 -
kernel/sched.c | 209 ++++++---
kernel/workqueue.c | 39 +-
lib/Makefile | 3
lib/pi.c | 516 +++++++++++++++++++++
17 files changed, 1543 insertions(+), 850 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c




---

Gregory Haskins (8):
rtmutex: pi-boost locks as late as possible
rtmutex: convert rtmutexes to fully use the PI library
rtmutex: use runtime init for rtmutexes
RT: wrap the rt_rwlock "add reader" logic
rtmutex: formally initialize the rt_mutex_waiters
sched: rework task reference counting to work with the pi infrastructure
sched: add the basic PI infrastructure to the task_struct
add generalized priority-inheritance interface


Documentation/libpi.txt | 59 ++
include/linux/pi.h | 293 ++++++++++++
include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 18 -
include/linux/sched.h | 59 +-
include/linux/workqueue.h | 2
kernel/fork.c | 35 +
kernel/rcupreempt-boost.c | 25 -
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 1091 ++++++++++++++++++---------------------------
kernel/rtmutex_common.h | 19 -
kernel/rwlock_torture.c | 32 -
kernel/sched.c | 207 ++++++---
kernel/workqueue.c | 39 +-
lib/Makefile | 3
lib/pi.c | 489 ++++++++++++++++++++
17 files changed, 1531 insertions(+), 850 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c

--
Signature

2008-08-15 20:30:58

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

The kernel currently addresses priority-inversion through priority-
inheritence. However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure. This causes a few
problems:

1) This tightly coupled relationship makes it difficult to extend to
other areas of the kernel (for instance, pi-aware wait-queues may
be desirable).
2) Enhancing the rtmutex infrastructure becomes challenging because
there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version. The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target. A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source. For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <[email protected]>
---

Documentation/libpi.txt | 59 ++++++
include/linux/pi.h | 293 ++++++++++++++++++++++++++++
lib/Makefile | 3
lib/pi.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 843 insertions(+), 1 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a priority source, and its boosted target.
+
+A pi_source is simply a reference to some arbitrary priority value that may range from 0 (highest prio), to MAX_PRIO (currently 140, lowest prio). A pi_source calls pi_sink.boost() whenever it wishes to boost the sink to (at least minimally) the priority value that the source represents. It uses pi_sink.boost() for both the initial boosting, or for any subsequent refreshes to the value (even if the value is decreasing in logical priority). The policy of the sink will dictate what happens as a result of that boost. Likewise, a pi_source calls pi_sink.deboost() to stop contributing to the sink's minimum priority.
+
+It is important to note that a source is a reference to a priority value, not a value itself. This is one of the concepts that allows the interface to be idempotent, which is important for properly updating a chain of sources and sinks in the proper order. If we passed the priority on the stack, the order in which the system executes could allow the actual value that is set to race.
+
+Nodes:
+
+A pi_node is a convenience object which is simultaneously a source and a sink. As its name suggests, it would typically be deployed as a node in a pi-chain. Other pi_sources can boost a node via its pi_sink.boost() interface. Likewise, a node can boost a fixed number of sinks via the node.add_sink() interface.
+
+Generally speaking, a node takes care of many common operations associated with being a “link in the chain”, such as:
+
+ 1) determining the current priority of the node based on the (logically) highest priority source that is boosting the node.
+ 2) boosting/deboosting upstream sinks whenever the node locally changes priority.
+ 3) taking care to avoid deadlock during a chain update.
+
+Design details:
+
+Destruction:
+
+The pi-library objects are designed to be implicitly-destructable (meaning they do not require an explicit “free()” operation when they are not used anymore). This is important considering their intended use (spinlock_t's which are also implicitly-destructable). As such, any allocations needed for operation must come from internal structure storage as there will be no opportunity to free it later.
+
+Multiple sinks per Node:
+
+We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. “task->pi_blocked_on”). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called “leaf sinks”.
+
+Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
+
+The following diagram depicts an example relationship (warning: cheesy ascii art)
+
+ --------- ---------
+ | leaf | | leaf |
+ --------- ---------
+ / /
+ --------- / ---------- / --------- ---------
+ ->-| node |->---| node |-->---| node |->---| leaf |
+ --------- ---------- --------- ---------
+
+The reason why this was done was to unify the notion of a “sink” to a single interface, rather than having something like task->pi_blocks_on and a separate callback for the leaf action. Instead, any downstream object can be represented by a sink, and the implementation details are hidden (e.g. im a task, im a lock, im a node, im a work-item, im a wait-queue, etc).
+
+Sinkrefs:
+
+Each pi_sink.boost() operation is represented by a unique pi_source to properly facilitate a one node to many source relationship. Therefore, if a pi_node is to act as aggregator to multiple sinks, it implicitly must have one internal pi_source object for every sink that is added (via node.add_sink(). This pi_source object has to be internally managed for the lifetime of the sink reference.
+
+Recall that due to the implicit-destruction requirement above, and the fact that we will typically be executing in a preempt-disabled region, we have to be very careful about how we allocate references to those sinks. More on that next. But long story short we limit the number of sinks to MAX_PI_DEPENDENDICES (currently 5).
+
+Locking:
+
+(work in progress....)
+
+
+
+
+
diff --git a/include/linux/pi.h b/include/linux/pi.h
new file mode 100644
index 0000000..5535474
--- /dev/null
+++ b/include/linux/pi.h
@@ -0,0 +1,293 @@
+/*
+ * see Documentation/libpi.txt for details
+ */
+
+#ifndef _LINUX_PI_H
+#define _LINUX_PI_H
+
+#include <linux/list.h>
+#include <linux/plist.h>
+#include <asm/atomic.h>
+
+#define MAX_PI_DEPENDENCIES 5
+
+struct pi_source {
+ struct plist_node list;
+ int *prio;
+ int boosted;
+};
+
+
+#define PI_FLAG_DEFER_UPDATE (1 << 0)
+#define PI_FLAG_ALREADY_BOOSTED (1 << 1)
+
+struct pi_sink;
+
+struct pi_sink_ops {
+ int (*boost)(struct pi_sink *sink, struct pi_source *src,
+ unsigned int flags);
+ int (*deboost)(struct pi_sink *sink, struct pi_source *src,
+ unsigned int flags);
+ int (*update)(struct pi_sink *sink,
+ unsigned int flags);
+ int (*free)(struct pi_sink *sink,
+ unsigned int flags);
+};
+
+struct pi_sink {
+ atomic_t refs;
+ struct pi_sink_ops *ops;
+};
+
+enum pi_state {
+ pi_state_boost,
+ pi_state_boosted,
+ pi_state_deboost,
+ pi_state_free,
+};
+
+/*
+ * NOTE: PI must always use a true (e.g. raw) spinlock, since it is used by
+ * rtmutex infrastructure.
+ */
+
+struct pi_sinkref {
+ raw_spinlock_t lock;
+ struct list_head list;
+ enum pi_state state;
+ struct pi_sink *sink;
+ struct pi_source src;
+ atomic_t refs;
+};
+
+struct pi_sinkref_pool {
+ struct list_head free;
+ struct pi_sinkref data[MAX_PI_DEPENDENCIES];
+};
+
+struct pi_node {
+ raw_spinlock_t lock;
+ int prio;
+ struct pi_sink sink;
+ struct pi_sinkref_pool sinkref_pool;
+ struct list_head sinks;
+ struct plist_head srcs;
+};
+
+/**
+ * pi_node_init - initialize a pi_node before use
+ * @node: a node context
+ */
+extern void pi_node_init(struct pi_node *node);
+
+/**
+ * pi_add_sink - add a sink as an downstream object
+ * @node: the node context
+ * @sink: the sink context to add to the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ * PI_FLAG_ALREADY_BOOSTED - Do not perform initial boosting
+ *
+ * This function registers a sink to get notified whenever the
+ * node changes priority.
+ *
+ * Note: By default, this function will schedule the newly added sink
+ * to get an inital boost notification on the next update (even
+ * without the presence of a priority transition). However, if the
+ * ALREADY_BOOSTED flag is specified, the sink is initially marked as
+ * BOOSTED and will only get notified if the node changes priority
+ * in the future.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_add_sink(struct pi_node *node, struct pi_sink *sink,
+ unsigned int flags);
+
+/**
+ * pi_del_sink - del a sink from the current downstream objects
+ * @node: the node context
+ * @sink: the sink context to delete from the node
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a sink from the node.
+ *
+ * Note: The sink will not actually become fully deboosted until
+ * a call to node.update() successfully returns.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+extern int pi_del_sink(struct pi_node *node, struct pi_sink *sink,
+ unsigned int flags);
+
+/**
+ * pi_sink_init - initialize a pi_sink before use
+ * @sink: a sink context
+ * @ops: pointer to an pi_sink_ops structure
+ */
+static inline void
+pi_sink_init(struct pi_sink *sink, struct pi_sink_ops *ops)
+{
+ atomic_set(&sink->refs, 0);
+ sink->ops = ops;
+}
+
+/**
+ * pi_source_init - initialize a pi_source before use
+ * @src: a src context
+ * @prio: pointer to a priority value
+ *
+ * A pointer to a priority value is used so that boost and update
+ * are fully idempotent.
+ */
+static inline void
+pi_source_init(struct pi_source *src, int *prio)
+{
+ plist_node_init(&src->list, *prio);
+ src->prio = prio;
+ src->boosted = 0;
+}
+
+/**
+ * pi_boost - boost a node with a pi_source
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function registers a priority source with the node, possibly
+ * boosting its value if the new source is the highest registered source.
+ *
+ * This function is used to both initially register a source, as well as
+ * to notify the node if the value changes in the future (even if the
+ * priority is decreasing).
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_boost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *sink = &node->sink;
+
+ if (sink->ops->boost)
+ return sink->ops->boost(sink, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_deboost - deboost a pi_source from a node
+ * @node: the node context
+ * @src: the src context to boost the node with
+ * @flags: optional flags to modify behavior
+ * PI_FLAG_DEFER_UPDATE - Do not perform sync update
+ *
+ * This function unregisters a priority source from the node, possibly
+ * deboosting its value if the departing source was the highest
+ * registered source.
+ *
+ * Note: By default, this function will synchronously update the
+ * chain unless the DEFER_UPDATE flag is specified.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_deboost(struct pi_node *node, struct pi_source *src, unsigned int flags)
+{
+ struct pi_sink *sink = &node->sink;
+
+ if (sink->ops->deboost)
+ return sink->ops->deboost(sink, src, flags);
+
+ return 0;
+}
+
+/**
+ * pi_update - force a manual chain update
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * This function will push any priority changes (as a result of
+ * boost/deboost or add_sink/del_sink) down through the chain.
+ * If no changes are necessary, this function is a no-op.
+ *
+ * Returns: (int)
+ * 0 = success
+ * any other value = failure
+ */
+static inline int
+pi_update(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *sink = &node->sink;
+
+ if (sink->ops->update)
+ return sink->ops->update(sink, flags);
+
+ return 0;
+}
+
+/**
+ * pi_sink_put - down the reference count, freeing the sink if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_sink_put(struct pi_sink *sink, unsigned int flags)
+{
+ if (atomic_dec_and_test(&sink->refs)) {
+ if (sink->ops->free)
+ sink->ops->free(sink, flags);
+ }
+}
+
+
+/**
+ * pi_get - up the reference count
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_get(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *sink = &node->sink;
+
+ atomic_inc(&sink->refs);
+}
+
+/**
+ * pi_put - down the reference count, freeing the node if 0
+ * @node: the node context
+ * @flags: optional flags to modify behavior. Reserved, must be 0.
+ *
+ * Returns: none
+ */
+static inline void
+pi_put(struct pi_node *node, unsigned int flags)
+{
+ struct pi_sink *sink = &node->sink;
+
+ pi_sink_put(sink, flags);
+}
+
+#endif /* _LINUX_PI_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5187924..df81ad7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -23,7 +23,8 @@ lib-$(CONFIG_SMP) += cpumask.o
lib-y += kobject.o kref.o klist.o

obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
- bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o
+ bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
+ pi.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/pi.c b/lib/pi.c
new file mode 100644
index 0000000..d00042c
--- /dev/null
+++ b/lib/pi.c
@@ -0,0 +1,489 @@
+/*
+ * lib/pi.c
+ *
+ * Priority-Inheritance library
+ *
+ * Copyright (C) 2008 Novell
+ *
+ * Author: Gregory Haskins <[email protected]>
+ *
+ * This code provides a generic framework for preventing priority
+ * inversion by means of priority-inheritance. (see Documentation/libpi.txt
+ * for details)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/pi.h>
+
+
+struct updater {
+ int update;
+ struct pi_sinkref *sinkref;
+ struct pi_sink *sink;
+};
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref_pool
+ *-----------------------------------------------------------
+ */
+
+static void
+pi_sinkref_pool_init(struct pi_sinkref_pool *pool)
+{
+ int i;
+
+ INIT_LIST_HEAD(&pool->free);
+
+ for (i = 0; i < MAX_PI_DEPENDENCIES; ++i) {
+ struct pi_sinkref *sinkref = &pool->data[i];
+
+ memset(sinkref, 0, sizeof(*sinkref));
+ INIT_LIST_HEAD(&sinkref->list);
+ list_add_tail(&sinkref->list, &pool->free);
+ }
+}
+
+static struct pi_sinkref *
+pi_sinkref_alloc(struct pi_sinkref_pool *pool)
+{
+ struct pi_sinkref *sinkref;
+
+ if (list_empty(&pool->free))
+ return NULL;
+
+ sinkref = list_first_entry(&pool->free, struct pi_sinkref, list);
+ list_del(&sinkref->list);
+ memset(sinkref, 0, sizeof(*sinkref));
+
+ return sinkref;
+}
+
+static void
+pi_sinkref_free(struct pi_sinkref_pool *pool,
+ struct pi_sinkref *sinkref)
+{
+ list_add_tail(&sinkref->list, &pool->free);
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_sinkref
+ *-----------------------------------------------------------
+ */
+
+static inline void
+_pi_sink_get(struct pi_sinkref *sinkref)
+{
+ atomic_inc(&sinkref->sink->refs);
+ atomic_inc(&sinkref->refs);
+}
+
+static inline void
+_pi_sink_put_local(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ if (atomic_dec_and_lock(&sinkref->refs, &node->lock)) {
+ list_del(&sinkref->list);
+ pi_sinkref_free(&node->sinkref_pool, sinkref);
+ spin_unlock(&node->lock);
+ }
+}
+
+static inline void
+_pi_sink_put_all(struct pi_node *node, struct pi_sinkref *sinkref)
+{
+ struct pi_sink *sink = sinkref->sink;
+
+ _pi_sink_put_local(node, sinkref);
+ pi_sink_put(sink, 0);
+}
+
+/*
+ *-----------------------------------------------------------
+ * pi_node
+ *-----------------------------------------------------------
+ */
+
+static struct pi_node *node_of(struct pi_sink *sink)
+{
+ return container_of(sink, struct pi_node, sink);
+}
+
+static inline void
+__pi_boost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(src->boosted);
+
+ plist_node_init(&src->list, *src->prio);
+ plist_add(&src->list, &node->srcs);
+ src->boosted = 1;
+}
+
+static inline void
+__pi_deboost(struct pi_node *node, struct pi_source *src)
+{
+ BUG_ON(!src->boosted);
+
+ plist_del(&src->list, &node->srcs);
+ src->boosted = 0;
+}
+
+/*
+ * _pi_node_update - update the chain
+ *
+ * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
+ * that need to propagate up the chain. This is a step-wise process where we
+ * have to be careful about locking and preemption. By trying MAX_PI_DEPs
+ * times, we guarantee that this update routine is an effective barrier...
+ * all modifications made prior to the call to this barrier will have completed.
+ *
+ * Deadlock avoidance: This node may participate in a chain of nodes which
+ * form a graph of arbitrary structure. While the graph should technically
+ * never close on itself barring any bugs, we still want to protect against
+ * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
+ * from detecting this potential). To do this, we employ a dual-locking
+ * scheme where we can carefully control the order. That is: node->lock
+ * protects most of the node's internal state, but it will never be held
+ * across a chain update. sinkref->lock, on the other hand, can be held
+ * across a boost/deboost, and also guarantees proper execution order. Also
+ * note that no locks are held across an sink->update.
+ */
+static int
+_pi_node_update(struct pi_sink *sink, unsigned int flags)
+{
+ struct pi_node *node = node_of(sink);
+ struct pi_sinkref *sinkref;
+ unsigned long iflags;
+ int count = 0;
+ int i;
+ int pprio;
+ struct updater updaters[MAX_PI_DEPENDENCIES];
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ pprio = node->prio;
+
+ if (!plist_head_empty(&node->srcs))
+ node->prio = plist_first(&node->srcs)->prio;
+ else
+ node->prio = MAX_PRIO;
+
+ list_for_each_entry(sinkref, &node->sinks, list) {
+ /*
+ * If the priority is changing, or if this is a
+ * BOOST/DEBOOST, we consider this sink "stale"
+ */
+ if (pprio != node->prio
+ || sinkref->state != pi_state_boosted) {
+ struct updater *iter = &updaters[count++];
+
+ BUG_ON(!atomic_read(&sinkref->sink->refs));
+ _pi_sink_get(sinkref);
+
+ iter->update = 1;
+ iter->sinkref = sinkref;
+ iter->sink = sinkref->sink;
+ }
+ }
+
+ spin_unlock(&node->lock);
+
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+ unsigned int lflags = PI_FLAG_DEFER_UPDATE;
+ struct pi_sink *sink;
+
+ sinkref = iter->sinkref;
+ sink = iter->sink;
+
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ sinkref->state = pi_state_boosted;
+ /* Fall through */
+ case pi_state_boosted:
+ sink->ops->boost(sink, &sinkref->src, lflags);
+ break;
+ case pi_state_deboost:
+ sink->ops->deboost(sink, &sinkref->src, lflags);
+ sinkref->state = pi_state_free;
+
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * above.
+ */
+ _pi_sink_put_all(node, sinkref);
+ break;
+ case pi_state_free:
+ iter->update = 0;
+ break;
+ default:
+ panic("illegal sinkref type: %d", sinkref->state);
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ /*
+ * We will drop the sinkref reference while still holding the
+ * preempt/irqs off so that the memory is returned synchronously
+ * to the system.
+ */
+ _pi_sink_put_local(node, sinkref);
+ }
+
+ local_irq_restore(iflags);
+
+ /*
+ * Note: At this point, sinkref is invalid since we put'd
+ * it above, but sink is valid since we still hold the remote
+ * reference. This is key to the design because it allows us
+ * to synchronously free the sinkref object, yet maintain a
+ * reference to the sink across the update
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ if (iter->update)
+ iter->sink->ops->update(iter->sink, 0);
+ }
+
+ /*
+ * We perform all the free opertations together at the end, using
+ * only automatic/stack variables since any one of these operations
+ * could result in our node object being deallocated
+ */
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+
+ pi_sink_put(iter->sink, 0);
+ }
+
+ return 0;
+}
+
+static int
+_pi_del_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ struct updater updaters[MAX_PI_DEPENDENCIES];
+ unsigned long iflags;
+ int count = 0;
+ int i;
+
+ local_irq_save(iflags);
+ spin_lock(&node->lock);
+
+ list_for_each_entry(sinkref, &node->sinks, list) {
+ if (!sink || sink == sinkref->sink) {
+ struct updater *iter = &updaters[count++];
+
+ _pi_sink_get(sinkref);
+ iter->sinkref = sinkref;
+ iter->sink = sinkref->sink;
+ }
+ }
+
+ spin_unlock(&node->lock);
+
+ for (i = 0; i < count; ++i) {
+ struct updater *iter = &updaters[i];
+ int remove = 0;
+
+ sinkref = iter->sinkref;
+
+ spin_lock(&sinkref->lock);
+
+ switch (sinkref->state) {
+ case pi_state_boost:
+ /*
+ * This state indicates the sink was never formally
+ * boosted so we can just delete it immediately
+ */
+ remove = 1;
+ break;
+ case pi_state_boosted:
+ if (sinkref->sink->ops->deboost)
+ /*
+ * If the sink supports deboost notification,
+ * schedule it for deboost at the next update
+ */
+ sinkref->state = pi_state_deboost;
+ else
+ /*
+ * ..otherwise schedule it for immediate
+ * removal
+ */
+ remove = 1;
+ break;
+ default:
+ break;
+ }
+
+ if (remove) {
+ /*
+ * drop the ref that we took when the sinkref
+ * was allocated. We still hold a ref from
+ * above
+ */
+ _pi_sink_put_all(node, sinkref);
+ sinkref->state = pi_state_free;
+ }
+
+ spin_unlock(&sinkref->lock);
+
+ _pi_sink_put_local(node, sinkref);
+ }
+
+ local_irq_restore(iflags);
+
+ for (i = 0; i < count; ++i)
+ pi_sink_put(updaters[i].sink, 0);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->sink, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_boost(struct pi_sink *sink, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(sink);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ if (src->boosted)
+ __pi_deboost(node, src);
+ __pi_boost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(sink, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_deboost(struct pi_sink *sink, struct pi_source *src,
+ unsigned int flags)
+{
+ struct pi_node *node = node_of(sink);
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+ __pi_deboost(node, src);
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(sink, 0);
+
+ return 0;
+}
+
+static int
+_pi_node_free(struct pi_sink *sink, unsigned int flags)
+{
+ struct pi_node *node = node_of(sink);
+
+ /*
+ * When the node is freed, we should perform an implicit
+ * del_sink on any remaining sinks we may have.
+ */
+ return _pi_del_sink(node, NULL, flags);
+}
+
+static struct pi_sink_ops pi_node_sink = {
+ .boost = _pi_node_boost,
+ .deboost = _pi_node_deboost,
+ .update = _pi_node_update,
+ .free = _pi_node_free,
+};
+
+void
+pi_node_init(struct pi_node *node)
+{
+ spin_lock_init(&node->lock);
+ node->prio = MAX_PRIO;
+ atomic_set(&node->sink.refs, 1);
+ node->sink.ops = &pi_node_sink;
+ pi_sinkref_pool_init(&node->sinkref_pool);
+ INIT_LIST_HEAD(&node->sinks);
+ plist_head_init(&node->srcs, &node->lock);
+}
+
+int
+pi_add_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
+{
+ struct pi_sinkref *sinkref;
+ int ret = 0;
+ unsigned long iflags;
+
+ spin_lock_irqsave(&node->lock, iflags);
+
+ if (!atomic_read(&node->sink.refs)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ sinkref = pi_sinkref_alloc(&node->sinkref_pool);
+ if (!sinkref) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_init(&sinkref->lock);
+ INIT_LIST_HEAD(&sinkref->list);
+
+ if (flags & PI_FLAG_ALREADY_BOOSTED)
+ sinkref->state = pi_state_boosted;
+ else
+ /*
+ * Schedule it for addition at the next update
+ */
+ sinkref->state = pi_state_boost;
+
+ pi_source_init(&sinkref->src, &node->prio);
+ sinkref->sink = sink;
+
+ /* set one ref from ourselves. It will be dropped on del_sink */
+ atomic_inc(&sinkref->sink->refs);
+ atomic_set(&sinkref->refs, 1);
+
+ list_add_tail(&sinkref->list, &node->sinks);
+
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ if (!(flags & PI_FLAG_DEFER_UPDATE))
+ _pi_node_update(&node->sink, 0);
+
+ return 0;
+
+ out:
+ spin_unlock_irqrestore(&node->lock, iflags);
+
+ return ret;
+}
+
+int
+pi_del_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
+{
+ /*
+ * There may be multiple matches to sink because sometimes a
+ * deboost/free may still be pending an update when the same
+ * node has been added. So we want to process any and all
+ * instances that match our target
+ */
+ return _pi_del_sink(node, sink, flags);
+}
+
+
+

2008-08-15 20:31:28

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 2/8] sched: add the basic PI infrastructure to the task_struct

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place. Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 37 +++++++--
include/linux/workqueue.h | 2
kernel/fork.c | 1
kernel/rcupreempt-boost.c | 23 +-----
kernel/rtmutex.c | 6 +
kernel/sched.c | 188 ++++++++++++++++++++++++++++++++-------------
kernel/workqueue.c | 39 ++++++++-
7 files changed, 206 insertions(+), 90 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..5521a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
#include <linux/task_io_accounting.h>
#include <linux/kobject.h>
#include <linux/latencytop.h>
+#include <linux/pi.h>

#include <asm/processor.h>

@@ -1125,6 +1126,7 @@ struct task_struct {
int prio, static_prio, normal_prio;
#ifdef CONFIG_PREEMPT_RCU_BOOST
int rcu_prio;
+ struct pi_source rcu_prio_src;
#endif
const struct sched_class *sched_class;
struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;

+ struct {
+ struct pi_source src; /* represents normal_prio to 'this' */
+ struct pi_node node;
+ struct pi_sink sink; /* registered to 'this' to get updates */
+ int prio;
+ } pi;
+
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
struct plist_head pi_waiters;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
+ int rtmutex_prio;
+ struct pi_source rtmutex_prio_src;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
#endif
};

+static inline int
+task_pi_boost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_boost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_deboost(struct task_struct *p, struct pi_source *src,
+ unsigned int flags)
+{
+ return pi_deboost(&p->pi.node, src, flags);
+}
+
+static inline int
+task_pi_update(struct task_struct *p, unsigned int flags)
+{
+ return pi_update(&p->pi.node, flags);
+}
+
#ifdef CONFIG_PREEMPT_RT
# define set_printk_might_sleep(x) do { current->in_printk = x; } while(0)
#else
@@ -1774,14 +1805,8 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-extern void task_setprio(struct task_struct *p, int prio);
-
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
-static inline void rt_mutex_setprio(struct task_struct *p, int prio)
-{
- task_setprio(p, prio);
-}
extern void rt_mutex_adjust_pi(struct task_struct *p);
#else
static inline int rt_mutex_getprio(struct task_struct *p)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 229179e..3dc4ed9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -11,6 +11,7 @@
#include <linux/lockdep.h>
#include <linux/plist.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>
#include <asm/atomic.h>

struct workqueue_struct;
@@ -31,6 +32,7 @@ struct work_struct {
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct plist_node entry;
work_func_t func;
+ struct pi_source pi_src;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b49488d..399a0d0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -990,6 +990,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->rcu_flipctr_idx = 0;
#ifdef CONFIG_PREEMPT_RCU_BOOST
p->rcu_prio = MAX_PRIO;
+ pi_source_init(&p->rcu_prio_src, &p->rcu_prio);
p->rcub_rbdp = NULL;
p->rcub_state = RCU_BOOST_IDLE;
INIT_LIST_HEAD(&p->rcub_entry);
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index 5282b19..e8d9d76 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -232,14 +232,11 @@ static inline int rcu_is_boosted(struct task_struct *task)
static void rcu_boost_task(struct task_struct *task)
{
WARN_ON(!irqs_disabled());
- WARN_ON_SMP(!spin_is_locked(&task->pi_lock));

rcu_trace_boost_task_boost_called(RCU_BOOST_ME);

- if (task->rcu_prio < task->prio) {
+ if (task_pi_boost(task, &task->rcu_prio_src, 0))
rcu_trace_boost_task_boosted(RCU_BOOST_ME);
- task_setprio(task, task->rcu_prio);
- }
}

/**
@@ -275,26 +272,17 @@ void __rcu_preempt_boost(void)
rbd = &__get_cpu_var(rcu_boost_data);
spin_lock(&rbd->rbs_lock);

- spin_lock(&curr->pi_lock);
-
curr->rcub_rbdp = rbd;

rcu_trace_boost_try_boost(rbd);

- prio = rt_mutex_getprio(curr);
-
if (list_empty(&curr->rcub_entry))
list_add_tail(&curr->rcub_entry, &rbd->rbs_toboost);
- if (prio <= rbd->rbs_prio)
- goto out;
-
- rcu_trace_boost_boosted(curr->rcub_rbdp);

set_rcu_prio(curr, rbd->rbs_prio);
rcu_boost_task(curr);

out:
- spin_unlock(&curr->pi_lock);
spin_unlock_irqrestore(&rbd->rbs_lock, flags);
}

@@ -353,15 +341,12 @@ void __rcu_preempt_unboost(void)

rcu_trace_boost_unboosted(rbd);

- set_rcu_prio(curr, MAX_PRIO);
+ task_pi_deboost(curr, &curr->rcu_prio_src, 0);

- spin_lock(&curr->pi_lock);
- prio = rt_mutex_getprio(curr);
- task_setprio(curr, prio);
+ set_rcu_prio(curr, MAX_PRIO);

curr->rcub_rbdp = NULL;

- spin_unlock(&curr->pi_lock);
out:
spin_unlock_irqrestore(&rbd->rbs_lock, flags);
}
@@ -393,9 +378,7 @@ static int __rcu_boost_readers(struct rcu_boost_dat *rbd, int prio, unsigned lon
list_move_tail(&p->rcub_entry,
&rbd->rbs_boosted);
set_rcu_prio(p, prio);
- spin_lock(&p->pi_lock);
rcu_boost_task(p);
- spin_unlock(&p->pi_lock);

/*
* Now we release the lock to allow for a higher
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 377949a..7d11380 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -178,8 +178,10 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
{
int prio = rt_mutex_getprio(task);

- if (task->prio != prio)
- rt_mutex_setprio(task, prio);
+ if (task->rtmutex_prio != prio) {
+ task->rtmutex_prio = prio;
+ task_pi_boost(task, &task->rtmutex_prio_src, 0);
+ }
}

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 54ea580..0732a9b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1709,26 +1709,6 @@ static inline int normal_prio(struct task_struct *p)
}

/*
- * Calculate the current priority, i.e. the priority
- * taken into account by the scheduler. This value might
- * be boosted by RT tasks, or might be boosted by
- * interactivity modifiers. Will be RT if the task got
- * RT-boosted. If not then it returns p->normal_prio.
- */
-static int effective_prio(struct task_struct *p)
-{
- p->normal_prio = normal_prio(p);
- /*
- * If we are RT tasks or we were boosted to RT priority,
- * keep the priority unchanged. Otherwise, update priority
- * to the normal priority:
- */
- if (!rt_prio(p->prio))
- return p->normal_prio;
- return p->prio;
-}
-
-/*
* activate_task - move a task to the runqueue.
*/
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -2375,6 +2355,58 @@ static void __sched_fork(struct task_struct *p)
p->state = TASK_RUNNING;
}

+static int
+task_pi_boost_cb(struct pi_sink *sink, struct pi_source *src,
+ unsigned int flags)
+{
+ struct task_struct *p = container_of(sink, struct task_struct, pi.sink);
+
+ /*
+ * We dont need any locking here, since the .boost operation
+ * is already guaranteed to be mutually exclusive
+ */
+ p->pi.prio = *src->prio;
+
+ return 0;
+}
+
+static int task_pi_update_cb(struct pi_sink *sink, unsigned int flags);
+
+static struct pi_sink_ops task_pi_sink = {
+ .boost = task_pi_boost_cb,
+ .update = task_pi_update_cb,
+};
+
+static inline void
+task_pi_init(struct task_struct *p)
+{
+ pi_node_init(&p->pi.node);
+
+ /*
+ * Feed our initial state of normal_prio into the PI infrastructure.
+ * We will update this whenever it changes
+ */
+ p->pi.prio = p->normal_prio;
+ pi_source_init(&p->pi.src, &p->normal_prio);
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+
+#ifdef CONFIG_RT_MUTEXES
+ p->rtmutex_prio = MAX_PRIO;
+ pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
+ task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
+#endif
+
+ /*
+ * We add our own task as a dependency of ourselves so that
+ * we get boost-notifications (via task_pi_boost_cb) whenever
+ * our priority is changed (locally e.g. setscheduler() or
+ * remotely via a pi-boost).
+ */
+ pi_sink_init(&p->pi.sink, &task_pi_sink);
+ pi_add_sink(&p->pi.node, &p->pi.sink,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}
+
/*
* fork()/clone()-time setup:
*/
@@ -2396,6 +2428,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

+ task_pi_init(p);
+
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
if (likely(sched_info_on()))
memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -2411,6 +2445,55 @@ void sched_fork(struct task_struct *p, int clone_flags)
}

/*
+ * In the past, task_setprio was exposed as an API. This variant is only
+ * meant to be called from pi_update functions (namely, task_updateprio() and
+ * task_pi_update_cb()). If you need to adjust the priority of a task,
+ * you should be using something like setscheduler() (permanent adjustments)
+ * or task_pi_boost() (temporary adjustments).
+ */
+static void
+task_setprio(struct task_struct *p, int prio)
+{
+ if (prio == p->prio)
+ return;
+
+ if (rt_prio(prio))
+ p->sched_class = &rt_sched_class;
+ else
+ p->sched_class = &fair_sched_class;
+
+ p->prio = prio;
+}
+
+static inline void
+task_updateprio(struct task_struct *p)
+{
+ int prio = normal_prio(p);
+
+ if (p->normal_prio != prio) {
+ p->normal_prio = prio;
+ set_load_weight(p);
+
+ /*
+ * Reboost our normal_prio entry, which will
+ * also chain-update any of our PI dependencies (of course)
+ * on our next update
+ */
+ task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);
+ }
+
+ /*
+ * If normal_prio is logically higher than our current setting,
+ * just assign the priority/class immediately so that any callers
+ * will see the update as synchronous without dropping the rq-lock
+ * to do a pi_update. Any descrepancy with pending pi-updates will
+ * automatically be corrected after we drop the rq-lock.
+ */
+ if (p->normal_prio < p->prio)
+ task_setprio(p, p->normal_prio);
+}
+
+/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
@@ -2426,7 +2509,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
BUG_ON(p->state != TASK_RUNNING);
update_rq_clock(rq);

- p->prio = effective_prio(p);
+ task_updateprio(p);

if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);
@@ -2447,6 +2530,8 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
p->sched_class->task_wake_up(rq, p);
#endif
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -4887,27 +4972,25 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
EXPORT_SYMBOL(sleep_on_timeout);

/*
- * task_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * Invoked whenever our priority changes by the PI library
*
* This function changes the 'effective' priority of a task. It does
* not touch ->normal_prio like __setscheduler().
*
- * Used by the rt_mutex code to implement priority inheritance logic
- * and by rcupreempt-boost to boost priorities of tasks sleeping
- * with rcu locks.
*/
-void task_setprio(struct task_struct *p, int prio)
+static int
+task_pi_update_cb(struct pi_sink *sink, unsigned int flags)
{
- unsigned long flags;
+ struct task_struct *p = container_of(sink, struct task_struct, pi.sink);
+ unsigned long iflags;
int oldprio, on_rq, running;
+ int prio = p->pi.prio;
struct rq *rq;
const struct sched_class *prev_class = p->sched_class;

BUG_ON(prio < 0 || prio > MAX_PRIO);

- rq = task_rq_lock(p, &flags);
+ rq = task_rq_lock(p, &iflags);

/*
* Idle task boosting is a nono in general. There is one
@@ -4929,6 +5012,10 @@ void task_setprio(struct task_struct *p, int prio)

update_rq_clock(rq);

+ /* If prio is not changing, bail */
+ if (prio == p->prio)
+ goto out_unlock;
+
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_current(rq, p);
@@ -4937,12 +5024,7 @@ void task_setprio(struct task_struct *p, int prio)
if (running)
p->sched_class->put_prev_task(rq, p);

- if (rt_prio(prio))
- p->sched_class = &rt_sched_class;
- else
- p->sched_class = &fair_sched_class;
-
- p->prio = prio;
+ task_setprio(p, prio);

// trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p));

@@ -4956,7 +5038,9 @@ void task_setprio(struct task_struct *p, int prio)
// trace_special(prev_resched, _need_resched(), 0);

out_unlock:
- task_rq_unlock(rq, &flags);
+ task_rq_unlock(rq, &iflags);
+
+ return 0;
}

void set_user_nice(struct task_struct *p, long nice)
@@ -4990,9 +5074,9 @@ void set_user_nice(struct task_struct *p, long nice)
}

p->static_prio = NICE_TO_PRIO(nice);
- set_load_weight(p);
old_prio = p->prio;
- p->prio = effective_prio(p);
+ task_updateprio(p);
+
delta = p->prio - old_prio;

if (on_rq) {
@@ -5007,6 +5091,8 @@ void set_user_nice(struct task_struct *p, long nice)
}
out_unlock:
task_rq_unlock(rq, &flags);
+
+ task_pi_update(p, 0);
}
EXPORT_SYMBOL(set_user_nice);

@@ -5123,23 +5209,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
BUG_ON(p->se.on_rq);

p->policy = policy;
- switch (p->policy) {
- case SCHED_NORMAL:
- case SCHED_BATCH:
- case SCHED_IDLE:
- p->sched_class = &fair_sched_class;
- break;
- case SCHED_FIFO:
- case SCHED_RR:
- p->sched_class = &rt_sched_class;
- break;
- }
-
p->rt_priority = prio;
- p->normal_prio = normal_prio(p);
- /* we are holding p->pi_lock already */
- p->prio = rt_mutex_getprio(p);
- set_load_weight(p);
+
+ task_updateprio(p);
}

/**
@@ -5264,6 +5336,7 @@ recheck:
__task_rq_unlock(rq);
spin_unlock_irqrestore(&p->pi_lock, flags);

+ task_pi_update(p, 0);
rt_mutex_adjust_pi(p);

return 0;
@@ -6686,6 +6759,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
deactivate_task(rq, rq->idle, 0);
rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
+ rq->idle->prio = rq->idle->normal_prio;
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
@@ -8395,6 +8469,8 @@ void __init sched_init(void)
open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
#endif

+ task_pi_init(&init_task);
+
#ifdef CONFIG_RT_MUTEXES
plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
#endif
@@ -8460,7 +8536,9 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(rq, p, 0);
+
__setscheduler(rq, p, SCHED_NORMAL, 0);
+
if (on_rq) {
activate_task(rq, p, 0);
resched_task(rq->curr);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f37979..5cd4b0e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -145,8 +145,13 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
plist_node_init(&work->entry, prio);
plist_add(&work->entry, &cwq->worklist);

- if (boost_prio < cwq->thread->prio)
- task_setprio(cwq->thread, boost_prio);
+ /*
+ * FIXME: We want to boost to boost_prio, but we dont record that
+ * value in the work_struct for later deboosting
+ */
+ pi_source_init(&work->pi_src, &work->entry.prio);
+ task_pi_boost(cwq->thread, &work->pi_src, 0);
+
wake_up(&cwq->more_work);
}

@@ -280,6 +285,10 @@ struct wq_barrier {
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
struct plist_head *worklist = &cwq->worklist;
+ struct pi_source pi_src;
+ int prio;
+
+ pi_source_init(&pi_src, &prio);

spin_lock_irq(&cwq->lock);
cwq->run_depth++;
@@ -292,10 +301,10 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)

again:
while (!plist_head_empty(worklist)) {
- int prio;
struct work_struct *work = plist_first_entry(worklist,
struct work_struct, entry);
work_func_t f = work->func;
+
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct
@@ -316,14 +325,28 @@ again:
}
prio = max(prio, 0);

- if (likely(cwq->thread->prio != prio))
- task_setprio(cwq->thread, prio);
-
cwq->current_work = work;
plist_del(&work->entry, worklist);
plist_node_init(&work->entry, MAX_PRIO);
spin_unlock_irq(&cwq->lock);

+ /*
+ * The owner is free to reuse the work object once we execute
+ * the work->func() below. Therefore we cannot leave the
+ * work->pi_src boosting our thread or it may get stomped
+ * on when the work item is requeued.
+ *
+ * So what we do is we boost ourselves with an on-the
+ * stack copy of the priority of the work item, and then
+ * deboost the workitem. Once the work is complete, we
+ * can then simply deboost the stack version.
+ *
+ * Note that this will not typically cause a pi-chain
+ * update since we are boosting the node laterally
+ */
+ task_pi_boost(current, &pi_src, PI_FLAG_DEFER_UPDATE);
+ task_pi_deboost(current, &work->pi_src, PI_FLAG_DEFER_UPDATE);
+
BUG_ON(get_wq_data(work) != cwq);
work_clear_pending(work);
leak_check(NULL);
@@ -334,6 +357,9 @@ again:
lock_release(&cwq->wq->lockdep_map, 1, _THIS_IP_);
leak_check(f);

+ /* Deboost the stack copy of the work->prio (see above) */
+ task_pi_deboost(current, &pi_src, 0);
+
spin_lock_irq(&cwq->lock);
cwq->current_work = NULL;
wake_up_all(&cwq->work_done);
@@ -357,7 +383,6 @@ again:
goto again;
}

- task_setprio(cwq->thread, current->normal_prio);
cwq->run_depth--;
spin_unlock_irq(&cwq->lock);
}

2008-08-15 20:32:20

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 5/8] RT: wrap the rt_rwlock "add reader" logic

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
# define rw_check_held(rwm) do { } while (0)
#endif

+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+ list_add(&rls->list, &rwm->readers);
+}
+
/*
* The fast path does not add itself to the reader list to keep
* from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
if (rls->list.prev && !list_empty(&rls->list))
return;

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);

/* change to reader, so no one else updates too */
rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
* it hasn't been added to the link list yet.
*/
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
rls->count++;
incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rls->lock = rwm;
rls->count = 1;
WARN_ON(rls->list.prev && !list_empty(&rls->list));
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rlw, rwm);
spin_unlock(&mutex->wait_lock);
} else
spin_unlock(&current->pi_lock);
@@ -2083,7 +2089,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)

/* Set us up for multiple readers or conflicts */

- list_add(&rls->list, &rwm->readers);
+ rt_rwlock_add_reader(rls, rwm);
rwm->owner = RT_RW_READER;

/*

2008-08-15 20:31:58

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 4/8] rtmutex: formally initialize the rt_mutex_waiters

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 26 ++++++++++++++------------
1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
}
#endif

+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+ memset(waiter, 0, sizeof(*waiter));
+
+ debug_rt_mutex_init_waiter(waiter);
+ waiter->task = NULL;
+ waiter->write_lock = 0;
+}
+
/*
* Slow path lock function spin_lock style: this variant is very
* careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
struct task_struct *orig_owner;
int missed = 0;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long saved_state = -1, state, flags;

+ init_waiter(&waiter);
+
spin_lock_irqsave(&mutex->wait_lock, flags);
init_rw_lists(rwm);

@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)

/* Owner is a writer (or a blocked writer). Block on the lock */

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
-
if (mtx) {
/*
* We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
int saved_lock_depth = -1;
unsigned long flags, saved_state = -1, state;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
+ init_waiter(&waiter);

/* we do PI different for writers that are blocked */
waiter.write_lock = 1;
@@ -2270,9 +2274,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter waiter;
unsigned long flags;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
- waiter.write_lock = 0;
+ init_waiter(&waiter);

spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);

2008-08-15 20:31:42

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 3/8] sched: rework task reference counting to work with the pi infrastructure

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 7 +++++--
kernel/fork.c | 32 +++++++++++++++-----------------
kernel/sched.c | 21 +++++++++++++++++++++
3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5521a64..7ae8eca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
struct pi_node node;
struct pi_sink sink; /* registered to 'this' to get updates */
int prio;
+ struct rcu_head rcu; /* for destruction cleanup */
+
} pi;

#ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
call_rcu(&t->rcu, __put_task_struct_cb);
}
#else
-extern void __put_task_struct(struct task_struct *t);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ pi_put(&t->pi.node, 0);
}
#endif

@@ -2469,6 +2470,8 @@ static inline int task_is_current(struct task_struct *task)
}
#endif

+extern void prepare_free_task(struct task_struct *tsk);
+
#define TASK_STATE_TO_CHAR_STR "RMSDTtZX"

#endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..4d4fba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);

-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
{
- struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(!tsk->exit_state);
WARN_ON(tsk == current);

+#ifdef CONFIG_PREEMPT_RT
+ WARN_ON(!tsk->exit_state);
+#else
+ WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
+
+#ifdef CONFIG_PREEMPT_RT
delayacct_tsk_free(tsk);
+#endif

if (!profile_handoff_task(tsk))
free_task(tsk);
}

-#else
-
-void __put_task_struct(struct task_struct *tsk)
+#ifdef CONFIG_PREEMPT_RT
+void __put_task_struct_cb(struct rcu_head *rhp)
{
- WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
- BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(tsk == current);
-
- security_task_free(tsk);
- free_uid(tsk->user);
- put_group_info(tsk->group_info);
+ struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ pi_put(&tsk->pi.node, 0);
}
+
#endif

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 0732a9b..eb14b9f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2370,11 +2370,32 @@ task_pi_boost_cb(struct pi_sink *sink, struct pi_source *src,
return 0;
}

+static void task_pi_free_rcu(struct rcu_head *rhp)
+{
+ struct task_struct *tsk = container_of(rhp, struct task_struct, pi.rcu);
+
+ prepare_free_task(tsk);
+}
+
+/*
+ * This function is invoked whenever the last references to a task have
+ * been dropped, and we should free the memory on the next rcu grace period
+ */
+static int task_pi_free_cb(struct pi_sink *sink, unsigned int flags)
+{
+ struct task_struct *p = container_of(sink, struct task_struct, pi.sink);
+
+ call_rcu(&p->pi.rcu, task_pi_free_rcu);
+
+ return 0;
+}
+
static int task_pi_update_cb(struct pi_sink *sink, unsigned int flags);

static struct pi_sink_ops task_pi_sink = {
.boost = task_pi_boost_cb,
.update = task_pi_update_cb,
+ .free = task_pi_free_cb,
};

static inline void

2008-08-15 20:32:39

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 6/8] rtmutex: use runtime init for rtmutexes

The system already has facilities to perform late/run-time init for
rtmutexes. We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rtmutex.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;

#define __RT_MUTEX_INITIALIZER(mutexname) \
{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
- , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
- , .owner = NULL \
__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}

#define DEFINE_RT_MUTEX(mutexname) \

2008-08-15 20:32:56

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 7/8] rtmutex: convert rtmutexes to fully use the PI library

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code. This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 15 -
include/linux/sched.h | 21 -
kernel/fork.c | 2
kernel/rcupreempt-boost.c | 2
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 944 ++++++++++++++-------------------------------
kernel/rtmutex_common.h | 18 -
kernel/rwlock_torture.c | 32 --
kernel/sched.c | 12 -
11 files changed, 321 insertions(+), 735 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..c5da71d 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
#include <asm/atomic.h>
#include <linux/spinlock_types.h>
#include <linux/sched_prio.h>
+#include <linux/pi.h>

#ifdef CONFIG_PREEMPT_RT
/*
@@ -67,6 +68,7 @@ struct rw_mutex {
atomic_t count; /* number of times held for read */
atomic_t owners; /* number of owners as readers */
struct list_head readers;
+ struct pi_sink pi_sink;
int prio;
};

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..e069182 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
#include <linux/linkage.h>
#include <linux/plist.h>
#include <linux/spinlock_types.h>
+#include <linux/pi.h>

/**
* The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
raw_spinlock_t wait_lock;
struct plist_head wait_list;
struct task_struct *owner;
+ struct {
+ struct pi_source src;
+ struct pi_node node;
+ struct pi_sink sink;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
const char *name, *file;
@@ -96,12 +103,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);

extern void rt_mutex_unlock(struct rt_mutex *lock);

-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk) \
- .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, &tsk.pi_lock), \
- INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7ae8eca..6462846 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1106,6 +1106,7 @@ struct reader_lock_struct {
struct rw_mutex *lock;
struct list_head list;
struct task_struct *task;
+ struct pi_source pi_src;
int count;
};

@@ -1309,15 +1310,6 @@ struct task_struct {

} pi;

-#ifdef CONFIG_RT_MUTEXES
- /* PI waiters blocked on a rt_mutex held by this task */
- struct plist_head pi_waiters;
- /* Deadlock detection and priority inheritance handling */
- struct rt_mutex_waiter *pi_blocked_on;
- int rtmutex_prio;
- struct pi_source rtmutex_prio_src;
-#endif
-
#ifdef CONFIG_DEBUG_MUTEXES
/* mutex deadlock detection */
struct mutex_waiter *blocked_on;
@@ -1806,17 +1798,6 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

-#ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_adjust_pi(struct task_struct *p);
-#else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
- return p->normal_prio;
-}
-# define rt_mutex_adjust_pi(p) do { } while (0)
-#endif
-
extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d4fba3..79ba6fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -885,8 +885,6 @@ static void rt_mutex_init_task(struct task_struct *p)
{
spin_lock_init(&p->pi_lock);
#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&p->pi_waiters, &p->pi_lock);
- p->pi_blocked_on = NULL;
# ifdef CONFIG_DEBUG_RT_MUTEXES
p->last_kernel_lock = NULL;
# endif
diff --git a/kernel/rcupreempt-boost.c b/kernel/rcupreempt-boost.c
index e8d9d76..85b3c2b 100644
--- a/kernel/rcupreempt-boost.c
+++ b/kernel/rcupreempt-boost.c
@@ -424,7 +424,7 @@ void rcu_boost_readers(void)

spin_lock_irqsave(&rcu_boost_wake_lock, flags);

- prio = rt_mutex_getprio(curr);
+ prio = get_rcu_prio(curr);

rcu_trace_boost_try_boost_readers(RCU_BOOST_ME);

diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 0d9cb54..2034ce1 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -57,8 +57,6 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)

void rt_mutex_debug_task_free(struct task_struct *task)
{
- DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
- DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
#ifdef CONFIG_PREEMPT_RT
WARN_ON(task->reader_lock_count);
#endif
@@ -156,7 +154,6 @@ void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
plist_node_init(&waiter->list_entry, MAX_PRIO);
- plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
waiter->deadlock_task_pid = NULL;
}

@@ -164,7 +161,6 @@ void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
{
put_pid(waiter->deadlock_task_pid);
DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
- DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
DEBUG_LOCKS_WARN_ON(waiter->task);
memset(waiter, 0x22, sizeof(*waiter));
}
diff --git a/kernel/rtmutex-tester.c b/kernel/rtmutex-tester.c
index 092e4c6..dff8781 100644
--- a/kernel/rtmutex-tester.c
+++ b/kernel/rtmutex-tester.c
@@ -373,11 +373,11 @@ static ssize_t sysfs_test_status(struct sys_device *dev, char *buf)
spin_lock(&rttest_lock);

curr += sprintf(curr,
- "O: %4d, E:%8d, S: 0x%08lx, P: %4d, N: %4d, B: %p, K: %d, M:",
+ "O: %4d, E:%8d, S: 0x%08lx, P: %4d, N: %4d, K: %d M:",
td->opcode, td->event, tsk->state,
(MAX_RT_PRIO - 1) - tsk->prio,
(MAX_RT_PRIO - 1) - tsk->normal_prio,
- tsk->pi_blocked_on, td->bkl);
+ td->bkl);

for (i = MAX_RT_TEST_MUTEXES - 1; i >=0 ; i--)
curr += sprintf(curr, "%d", td->mutexes[i]);
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 62fdc3d..8acbf23 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -58,14 +58,32 @@
* state.
*/

+static inline void
+rtmutex_pi_owner(struct rt_mutex *lock, struct task_struct *p, int add)
+{
+ if (!p || p == RT_RW_READER)
+ return;
+
+ if (add)
+ task_pi_boost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+ else
+ task_pi_deboost(p, &lock->pi.src, PI_FLAG_DEFER_UPDATE);
+}
+
static void
rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
unsigned long mask)
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock))
+ if (rt_mutex_has_waiters(lock)) {
+ struct task_struct *prev_owner = rt_mutex_owner(lock);
+
+ rtmutex_pi_owner(lock, prev_owner, 0);
+ rtmutex_pi_owner(lock, owner, 1);
+
val |= RT_MUTEX_HAS_WAITERS;
+ }

lock->owner = (struct task_struct *)val;
}
@@ -134,245 +152,88 @@ static inline int task_is_reader(struct task_struct *task) { return 0; }
#endif

int pi_initialized;
-
-/*
- * we initialize the wait_list runtime. (Could be done build-time and/or
- * boot-time.)
- */
-static inline void init_lists(struct rt_mutex *lock)
+static inline int rtmutex_pi_boost(struct pi_sink *sink,
+ struct pi_source *src,
+ unsigned int flags)
{
- if (unlikely(!lock->wait_list.prio_list.prev)) {
- plist_head_init(&lock->wait_list, &lock->wait_lock);
-#ifdef CONFIG_DEBUG_RT_MUTEXES
- pi_initialized++;
-#endif
- }
-}
-
-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio);
-
-/*
- * Calculate task priority from the waiter list priority
- *
- * Return task->normal_prio when the waiter list is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
- int prio = min(task->normal_prio, get_rcu_prio(task));
-
- prio = rt_mutex_get_readers_prio(task, prio);
-
- if (likely(!task_has_pi_waiters(task)))
- return prio;
-
- return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
-}
+ struct rt_mutex *lock = container_of(sink, struct rt_mutex, pi.sink);

-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
- int prio = rt_mutex_getprio(task);
-
- if (task->rtmutex_prio != prio) {
- task->rtmutex_prio = prio;
- task_pi_boost(task, &task->rtmutex_prio_src, 0);
- }
-}
-
-/*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-static void rt_mutex_adjust_prio(struct task_struct *task)
-{
- unsigned long flags;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ lock->pi.prio = *src->prio;

- spin_lock_irqsave(&task->pi_lock, flags);
- __rt_mutex_adjust_prio(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
+ return 0;
}

-/*
- * Max number of times we'll walk the boosting chain:
- */
-int max_lock_depth = 1024;
-
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth);
-/*
- * Adjust the priority chain. Also used for deadlock detection.
- * Decreases task's usage by one - may thus free the task.
- * Returns 0 or -EDEADLK.
- */
-static int rt_mutex_adjust_prio_chain(struct task_struct *task,
- int deadlock_detect,
- struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- int recursion_depth)
+static inline int rtmutex_pi_update(struct pi_sink *sink,
+ unsigned int flags)
{
- struct rt_mutex *lock;
- struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
- int detect_deadlock, ret = 0, depth = 0;
- unsigned long flags;
+ struct rt_mutex *lock = container_of(sink, struct rt_mutex, pi.sink);
+ struct task_struct *owner = NULL;
+ unsigned long iflags;

- detect_deadlock = debug_rt_mutex_detect_deadlock(orig_waiter,
- deadlock_detect);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- /*
- * The (de)boosting is a step by step approach with a lot of
- * pitfalls. We want this to be preemptible and we want hold a
- * maximum of two locks per step. So we have to check
- * carefully whether things change under us.
- */
- again:
- if (++depth > max_lock_depth) {
- static int prev_max;
+ if (rt_mutex_has_waiters(lock)) {
+ owner = rt_mutex_owner(lock);

- /*
- * Print this only once. If the admin changes the limit,
- * print a new message when reaching the limit again.
- */
- if (prev_max != max_lock_depth) {
- prev_max = max_lock_depth;
- printk(KERN_WARNING "Maximum lock depth %d reached "
- "task: %s (%d)\n", max_lock_depth,
- top_task->comm, task_pid_nr(top_task));
+ if (owner && owner != RT_RW_READER) {
+ rtmutex_pi_owner(lock, owner, 1);
+ get_task_struct(owner);
}
- put_task_struct(task);
-
- return deadlock_detect ? -EDEADLK : 0;
}
- retry:
- /*
- * Task can not go away as we did a get_task() before !
- */
- spin_lock_irqsave(&task->pi_lock, flags);

- waiter = task->pi_blocked_on;
- /*
- * Check whether the end of the boosting chain has been
- * reached or the state of the chain has changed while we
- * dropped the locks.
- */
- if (!waiter || !waiter->task)
- goto out_unlock_pi;
-
- /*
- * Check the orig_waiter state. After we dropped the locks,
- * the previous owner of the lock might have released the lock
- * and made us the pending owner:
- */
- if (orig_waiter && !orig_waiter->task)
- goto out_unlock_pi;
-
- /*
- * Drop out, when the task has no waiters. Note,
- * top_waiter can be NULL, when we are in the deboosting
- * mode!
- */
- if (top_waiter && (!task_has_pi_waiters(task) ||
- top_waiter != task_top_pi_waiter(task)))
- goto out_unlock_pi;
-
- /*
- * When deadlock detection is off then we check, if further
- * priority adjustment is necessary.
- */
- if (!detect_deadlock && waiter->list_entry.prio == task->prio)
- goto out_unlock_pi;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);

- lock = waiter->lock;
- if (!spin_trylock(&lock->wait_lock)) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- cpu_relax();
- goto retry;
+ if (owner && owner != RT_RW_READER) {
+ task_pi_update(owner, 0);
+ put_task_struct(owner);
}

- /* Deadlock detection */
- if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
- debug_rt_mutex_deadlock(deadlock_detect, orig_waiter, lock);
- spin_unlock(&lock->wait_lock);
- ret = deadlock_detect ? -EDEADLK : 0;
- goto out_unlock_pi;
- }
+ return 0;
+}

- top_waiter = rt_mutex_top_waiter(lock);
+static struct pi_sink_ops rtmutex_pi_sink = {
+ .boost = rtmutex_pi_boost,
+ .update = rtmutex_pi_update,
+};

- /* Requeue the waiter */
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->list_entry.prio = task->prio;
- plist_add(&waiter->list_entry, &lock->wait_list);
-
- /* Release the task */
- spin_unlock(&task->pi_lock);
- put_task_struct(task);
+static void init_pi(struct rt_mutex *lock)
+{
+ pi_node_init(&lock->pi.node);

- /* Grab the next task */
- task = rt_mutex_owner(lock);
+ lock->pi.prio = MAX_PRIO;
+ pi_source_init(&lock->pi.src, &lock->pi.prio);
+ pi_sink_init(&lock->pi.sink, &rtmutex_pi_sink);

- /*
- * Readers are special. We may need to boost more than one owner.
- */
- if (task_is_reader(task)) {
- ret = rt_mutex_adjust_readers(orig_lock, orig_waiter,
- top_task, lock,
- recursion_depth);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
- goto out;
- }
+ pi_add_sink(&lock->pi.node, &lock->pi.sink,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
+}

- get_task_struct(task);
- spin_lock(&task->pi_lock);
-
- if (waiter == rt_mutex_top_waiter(lock)) {
- /* Boost the owner */
- plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
-
- } else if (top_waiter == waiter) {
- /* Deboost the owner */
- plist_del(&waiter->pi_list_entry, &task->pi_waiters);
- waiter = rt_mutex_top_waiter(lock);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
- __rt_mutex_adjust_prio(task);
+/*
+ * we initialize the wait_list runtime. (Could be done build-time and/or
+ * boot-time.)
+ */
+static inline void init_lists(struct rt_mutex *lock)
+{
+ if (unlikely(!lock->wait_list.prio_list.prev)) {
+ plist_head_init(&lock->wait_list, &lock->wait_lock);
+ init_pi(lock);
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+ pi_initialized++;
+#endif
}
-
- spin_unlock(&task->pi_lock);
-
- top_waiter = rt_mutex_top_waiter(lock);
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- if (!detect_deadlock && waiter != top_waiter)
- goto out_put_task;
-
- goto again;
-
- out_unlock_pi:
- spin_unlock_irqrestore(&task->pi_lock, flags);
- out_put_task:
- put_task_struct(task);
- out:
- return ret;
}

/*
+ * Max number of times we'll walk the boosting chain:
+ */
+int max_lock_depth = 1024;
+
+/*
* Optimization: check if we can steal the lock from the
* assigned pending owner [which might not have taken the
* lock yet]:
@@ -380,7 +241,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *next;

if (!rt_mutex_owner_pending(lock))
return 0;
@@ -390,49 +250,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int mode)

WARN_ON(task_is_reader(rt_mutex_owner(lock)));

- spin_lock(&pendowner->pi_lock);
- if (!lock_is_stealable(pendowner, mode)) {
- spin_unlock(&pendowner->pi_lock);
- return 0;
- }
-
- /*
- * Check if a waiter is enqueued on the pending owners
- * pi_waiters list. Remove it and readjust pending owners
- * priority.
- */
- if (likely(!rt_mutex_has_waiters(lock))) {
- spin_unlock(&pendowner->pi_lock);
- return 1;
- }
-
- /* No chain handling, pending owner is not blocked on anything: */
- next = rt_mutex_top_waiter(lock);
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
- __rt_mutex_adjust_prio(pendowner);
- spin_unlock(&pendowner->pi_lock);
-
- /*
- * We are going to steal the lock and a waiter was
- * enqueued on the pending owners pi_waiters queue. So
- * we have to enqueue this waiter into
- * current->pi_waiters list. This covers the case,
- * where current is boosted because it holds another
- * lock and gets unboosted because the booster is
- * interrupted, so we would delay a waiter with higher
- * priority as current->normal_prio.
- *
- * Note: in the rare case of a SCHED_OTHER task changing
- * its priority and thus stealing the lock, next->task
- * might be current:
- */
- if (likely(next->task != current)) {
- spin_lock(&current->pi_lock);
- plist_add(&next->pi_list_entry, &current->pi_waiters);
- __rt_mutex_adjust_prio(current);
- spin_unlock(&current->pi_lock);
- }
- return 1;
+ return lock_is_stealable(pendowner, mode);
}

/*
@@ -486,74 +304,145 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
}

/*
- * Task blocks on lock.
- *
- * Prepare waiter and propagate pi chain
- *
- * This must be called with lock->wait_lock held.
+ * These callbacks are invoked whenever a waiter has changed priority.
+ * So we should requeue it within the lock->wait_list
*/
-static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- int detect_deadlock, unsigned long flags)
+
+static inline int rtmutex_waiter_pi_boost(struct pi_sink *sink,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct task_struct *owner = rt_mutex_owner(lock);
- struct rt_mutex_waiter *top_waiter = waiter;
- int chain_walk = 0, res;
+ struct rt_mutex_waiter *waiter;

- spin_lock(&current->pi_lock);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
- waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ waiter = container_of(sink, struct rt_mutex_waiter, pi.sink);

- /* Get the top priority waiter on the lock */
- if (rt_mutex_has_waiters(lock))
- top_waiter = rt_mutex_top_waiter(lock);
- plist_add(&waiter->list_entry, &lock->wait_list);
+ /*
+ * We dont need to take any locks here because the
+ * waiter->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ waiter->pi.prio = *src->prio;

- current->pi_blocked_on = waiter;
+ return 0;
+}

- spin_unlock(&current->pi_lock);
+static inline int rtmutex_waiter_pi_update(struct pi_sink *sink,
+ unsigned int flags)
+{
+ struct rt_mutex *lock;
+ struct rt_mutex_waiter *waiter;
+ unsigned long iflags;

- if (waiter == rt_mutex_top_waiter(lock)) {
- /* readers are handled differently */
- if (task_is_reader(owner)) {
- res = rt_mutex_adjust_readers(lock, waiter,
- current, lock, 0);
- return res;
- }
+ waiter = container_of(sink, struct rt_mutex_waiter, pi.sink);
+ lock = waiter->lock;

- spin_lock(&owner->pi_lock);
- plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
- plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+ spin_lock_irqsave(&lock->wait_lock, iflags);

- __rt_mutex_adjust_prio(owner);
- if (owner->pi_blocked_on)
- chain_walk = 1;
- spin_unlock(&owner->pi_lock);
+ /*
+ * If waiter->task is non-NULL, it means we are still valid in the
+ * pi list. Therefore, if waiter->pi.prio has changed since we
+ * queued ourselves, requeue it.
+ */
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
}
- else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
- chain_walk = 1;

- if (!chain_walk || task_is_reader(owner))
- return 0;
+ spin_unlock_irqrestore(&lock->wait_lock, iflags);
+
+ return 0;
+}
+
+static struct pi_sink_ops rtmutex_waiter_pi_sink = {
+ .boost = rtmutex_waiter_pi_boost,
+ .update = rtmutex_waiter_pi_update,
+};
+
+/*
+ * This must be called with lock->wait_lock held.
+ */
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ int has_waiters = rt_mutex_has_waiters(lock);
+
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+ pi_sink_init(&waiter->pi.sink, &rtmutex_waiter_pi_sink);

/*
- * The owner can't disappear while holding a lock,
- * so the owner struct is protected by wait_lock.
- * Gets dropped in rt_mutex_adjust_prio_chain()!
+ * Link the waiter object to the task so that we can adjust our
+ * position on the prio list if the priority is changed. Note
+ * that if the priority races between the time we recorded it
+ * above and the time it is set here, we will correct the race
+ * when we task_pi_update(current) below. Otherwise the the
+ * update is a no-op
*/
- get_task_struct(owner);
+ pi_add_sink(&current->pi.node, &waiter->pi.sink,
+ PI_FLAG_DEFER_UPDATE);

- spin_unlock_irqrestore(&lock->wait_lock, flags);
+ /*
+ * Link the lock object to the waiter so that we can form a chain
+ * to the owner
+ */
+ pi_add_sink(&current->pi.node, &lock->pi.node.sink,
+ PI_FLAG_DEFER_UPDATE);

- res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current, 0);
+ /*
+ * If we previously had no waiters, we are transitioning to
+ * a mode where we need to boost the owner
+ */
+ if (!has_waiters) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 1);
+ }

- spin_lock_irq(&lock->wait_lock);
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ /*
+ * We can stop boosting the owner if there are no more waiters
+ */
+ if (!rt_mutex_has_waiters(lock)) {
+ struct task_struct *owner = rt_mutex_owner(lock);
+ rtmutex_pi_owner(lock, owner, 0);
+ }

- return res;
+ /*
+ * Unlink the lock object from the waiter
+ */
+ pi_del_sink(&p->pi.node, &lock->pi.node.sink, PI_FLAG_DEFER_UPDATE);
+
+ /*
+ * Unlink the waiter object from the task. Note that we
+ * technically do not need an update for "p" because the
+ * .deboost will be processed synchronous to this call
+ * since there is no .deboost handler registered for
+ * the waiter sink
+ */
+ pi_del_sink(&p->pi.node, &waiter->pi.sink, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -566,24 +455,10 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
*/
static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
- struct rt_mutex_waiter *waiter;
- struct task_struct *pendowner;
- struct rt_mutex_waiter *next;
-
- spin_lock(&current->pi_lock);
+ struct rt_mutex_waiter *waiter = rt_mutex_top_waiter(lock);
+ struct task_struct *pendowner = waiter->task;

- waiter = rt_mutex_top_waiter(lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
-
- /*
- * Remove it from current->pi_waiters. We do not adjust a
- * possible priority boost right now. We execute wakeup in the
- * boosted mode and go back to normal after releasing
- * lock->wait_lock.
- */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- pendowner = waiter->task;
- waiter->task = NULL;
+ remove_waiter(lock, waiter);

/*
* Do the wakeup before the ownership change to give any spinning
@@ -621,113 +496,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
}

rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
-
- spin_unlock(&current->pi_lock);
-
- /*
- * Clear the pi_blocked_on variable and enqueue a possible
- * waiter into the pi_waiters list of the pending owner. This
- * prevents that in case the pending owner gets unboosted a
- * waiter with higher priority than pending-owner->normal_prio
- * is blocked on the unboosted (pending) owner.
- */
-
- if (rt_mutex_has_waiters(lock))
- next = rt_mutex_top_waiter(lock);
- else
- next = NULL;
-
- spin_lock(&pendowner->pi_lock);
-
- WARN_ON(!pendowner->pi_blocked_on);
- WARN_ON(pendowner->pi_blocked_on != waiter);
- WARN_ON(pendowner->pi_blocked_on->lock != lock);
-
- pendowner->pi_blocked_on = NULL;
-
- if (next)
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
-
- spin_unlock(&pendowner->pi_lock);
-}
-
-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long flags)
-{
- int first = (waiter == rt_mutex_top_waiter(lock));
- struct task_struct *owner = rt_mutex_owner(lock);
- int chain_walk = 0;
-
- spin_lock(&current->pi_lock);
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
- current->pi_blocked_on = NULL;
- spin_unlock(&current->pi_lock);
-
- if (first && owner != current && !task_is_reader(owner)) {
-
- spin_lock(&owner->pi_lock);
-
- plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
-
- if (rt_mutex_has_waiters(lock)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(lock);
- plist_add(&next->pi_list_entry, &owner->pi_waiters);
- }
- __rt_mutex_adjust_prio(owner);
-
- if (owner->pi_blocked_on)
- chain_walk = 1;
-
- spin_unlock(&owner->pi_lock);
- }
-
- WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
- if (!chain_walk)
- return;
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(owner);
-
- spin_unlock_irqrestore(&lock->wait_lock, flags);
-
- rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current, 0);
-
- spin_lock_irq(&lock->wait_lock);
-}
-
-/*
- * Recheck the pi chain, in case we got a priority setting
- *
- * Called from sched_setscheduler
- */
-void rt_mutex_adjust_pi(struct task_struct *task)
-{
- struct rt_mutex_waiter *waiter;
- unsigned long flags;
-
- spin_lock_irqsave(&task->pi_lock, flags);
-
- waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
- spin_unlock_irqrestore(&task->pi_lock, flags);
- return;
- }
-
- /* gets dropped in rt_mutex_adjust_prio_chain()! */
- get_task_struct(task);
- spin_unlock_irqrestore(&task->pi_lock, flags);
-
- rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task, 0);
}

/*
@@ -869,7 +637,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* but the lock got stolen by an higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(lock, &waiter, 0, flags);
+ add_waiter(lock, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -917,7 +685,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* can end up with a non-NULL waiter.task:
*/
if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);
/*
* try_to_take_rt_mutex() sets the waiter bit
* unconditionally. We might have to fix that up:
@@ -927,6 +695,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unlock:
spin_unlock_irqrestore(&lock->wait_lock, flags);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -954,8 +725,8 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
@@ -1126,6 +897,9 @@ static inline void
rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
{
list_add(&rls->list, &rwm->readers);
+
+ pi_source_init(&rls->pi_src, &rwm->prio);
+ task_pi_boost(rls->task, &rls->pi_src, PI_FLAG_DEFER_UPDATE);
}

/*
@@ -1249,21 +1023,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
waiter = rt_mutex_top_waiter(mutex);
if (!lock_is_stealable(waiter->task, mode))
return 0;
- /*
- * The pending reader has PI waiters,
- * but we are taking the lock.
- * Remove the waiters from the pending owner.
- */
- spin_lock(&mtxowner->pi_lock);
- plist_del(&waiter->pi_list_entry, &mtxowner->pi_waiters);
- spin_unlock(&mtxowner->pi_lock);
}
- } else if (rt_mutex_has_waiters(mutex)) {
- /* Readers do things differently with respect to PI */
- waiter = rt_mutex_top_waiter(mutex);
- spin_lock(&current->pi_lock);
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
}
/* Readers never own the mutex */
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
@@ -1275,7 +1035,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
if (incr) {
atomic_inc(&rwm->owners);
rw_check_held(rwm);
- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
rls = &current->owned_read_locks[reader_count];
@@ -1285,10 +1045,11 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
rt_rwlock_add_reader(rls, rwm);
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();
}
rt_mutex_deadlock_account_lock(mutex, current);
atomic_inc(&rwm->count);
+
return 1;
}

@@ -1378,7 +1139,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1417,7 +1178,7 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

WARN_ON(rt_mutex_owner(mutex) &&
rt_mutex_owner(mutex) != current &&
@@ -1430,6 +1191,9 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);
}

@@ -1457,13 +1221,13 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);
local_irq_save(flags);
- spin_lock(&current->pi_lock);
reader_count = current->reader_lock_count++;
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
current->owned_read_locks[reader_count].lock = rwm;
current->owned_read_locks[reader_count].count = 1;
} else
WARN_ON_ONCE(1);
+
/*
* If this task is no longer the sole owner of the lock
* or someone is blocking, then we need to add the task
@@ -1473,16 +1237,12 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
struct rt_mutex *mutex = &rwm->mutex;
struct reader_lock_struct *rls;

- /* preserve lock order, we only need wait_lock now */
- spin_unlock(&current->pi_lock);
-
spin_lock(&mutex->wait_lock);
rls = &current->owned_read_locks[reader_count];
if (!rls->list.prev || list_empty(&rls->list))
- rt_rwlock_add_reader(rlw, rwm);
+ rt_rwlock_add_reader(rls, rwm);
spin_unlock(&mutex->wait_lock);
- } else
- spin_unlock(&current->pi_lock);
+ }
local_irq_restore(flags);
return 1;
}
@@ -1591,7 +1351,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- task_blocks_on_rt_mutex(mutex, &waiter, 0, flags);
+ add_waiter(mutex, &waiter, &flags);
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
@@ -1630,7 +1390,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
}

if (unlikely(waiter.task))
- remove_waiter(mutex, &waiter, flags);
+ remove_waiter(mutex, &waiter);

/* check on unlock if we have any waiters. */
if (rt_mutex_has_waiters(mutex))
@@ -1642,6 +1402,9 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
if (mtx && unlikely(saved_lock_depth >= 0))
rt_reacquire_bkl(saved_lock_depth);

+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);
+
debug_rt_mutex_free_waiter(&waiter);

}
@@ -1733,7 +1496,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

for (i = current->reader_lock_count - 1; i >= 0; i--) {
if (current->owned_read_locks[i].lock == rwm) {
- spin_lock(&current->pi_lock);
+ preempt_disable();
current->owned_read_locks[i].count--;
if (!current->owned_read_locks[i].count) {
current->reader_lock_count--;
@@ -1743,9 +1506,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
WARN_ON(!rls->list.prev || list_empty(&rls->list));
list_del_init(&rls->list);
rls->lock = NULL;
+ task_pi_deboost(current, &rls->pi_src,
+ PI_FLAG_DEFER_UPDATE);
rw_check_held(rwm);
}
- spin_unlock(&current->pi_lock);
+ preempt_enable();
break;
}
}
@@ -1776,7 +1541,6 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)

/* If no one is blocked, then clear all ownership */
if (!rt_mutex_has_waiters(mutex)) {
- rwm->prio = MAX_PRIO;
/*
* If count is not zero, we are under the limit with
* no other readers.
@@ -1835,28 +1599,11 @@ rt_read_slowunlock(struct rw_mutex *rwm, int mtx)
rt_mutex_set_owner(mutex, RT_RW_READER, 0);
}

- if (rt_mutex_has_waiters(mutex)) {
- waiter = rt_mutex_top_waiter(mutex);
- rwm->prio = waiter->task->prio;
- /*
- * If readers still own this lock, then we need
- * to update the pi_list too. Readers have a separate
- * path in the PI chain.
- */
- if (reader_count) {
- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->pi_list_entry,
- &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
- }
- } else
- rwm->prio = MAX_PRIO;
-
out:
spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -1874,9 +1621,9 @@ rt_read_fastunlock(struct rw_mutex *rwm,
int reader_count;
int owners;

- spin_lock_irqsave(&current->pi_lock, flags);
+ local_irq_save(flags);
reader_count = --current->reader_lock_count;
- spin_unlock_irqrestore(&current->pi_lock, flags);
+ local_irq_restore(flags);

rt_mutex_deadlock_account_unlock(current);
if (unlikely(reader_count < 0)) {
@@ -1972,17 +1719,7 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&pendowner->pi_lock);
- plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- spin_lock(&reader->pi_lock);
- waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);
+ remove_waiter(mutex, waiter);

if (savestate)
wake_up_process_mutex(reader);
@@ -1995,32 +1732,12 @@ rt_write_slowunlock(struct rw_mutex *rwm, int mtx)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- spin_lock(&pendowner->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
-
- /* This could also be a reader (if reader_limit is set) */
- if (next->write_lock)
- /* add back in as top waiter */
- plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
- spin_unlock(&pendowner->pi_lock);
-
- rwm->prio = next->task->prio;
- } else
- rwm->prio = MAX_PRIO;
-
out:

spin_unlock_irqrestore(&mutex->wait_lock, flags);

- /* Undo pi boosting.when necessary */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

static inline void
@@ -2068,7 +1785,7 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
atomic_inc(&rwm->owners);
rw_check_held(rwm);

- spin_lock(&current->pi_lock);
+ preempt_disable();
reader_count = current->reader_lock_count++;
rls = &current->owned_read_locks[reader_count];
if (likely(reader_count < MAX_RWLOCK_DEPTH)) {
@@ -2076,12 +1793,11 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
rls->count = 1;
} else
WARN_ON_ONCE(1);
- spin_unlock(&current->pi_lock);
+ preempt_enable();

if (!rt_mutex_has_waiters(mutex)) {
/* We are sole owner, we are done */
rwm->owner = current;
- rwm->prio = MAX_PRIO;
mutex->owner = NULL;
spin_unlock_irqrestore(&mutex->wait_lock, flags);
return;
@@ -2102,17 +1818,8 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
while (waiter && !waiter->write_lock) {
struct task_struct *reader = waiter->task;

- spin_lock(&current->pi_lock);
plist_del(&waiter->list_entry, &mutex->wait_list);
-
- /* nop if not on a list */
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
-
- spin_lock(&reader->pi_lock);
waiter->task = NULL;
- reader->pi_blocked_on = NULL;
- spin_unlock(&reader->pi_lock);

/* downgrade is only for mutexes */
wake_up_process(reader);
@@ -2123,124 +1830,81 @@ rt_mutex_downgrade_write(struct rw_mutex *rwm)
waiter = NULL;
}

- /* If a writer is still pending, then update its plist. */
- if (rt_mutex_has_waiters(mutex)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(mutex);
-
- /* setup this mutex prio for read */
- rwm->prio = next->task->prio;
-
- spin_lock(&current->pi_lock);
- /* delete incase we didn't go through the loop */
- plist_del(&next->pi_list_entry, &current->pi_waiters);
- spin_unlock(&current->pi_lock);
- /* No need to add back since readers don't have PI waiters */
- } else
- rwm->prio = MAX_PRIO;
-
rt_mutex_set_owner(mutex, RT_RW_READER, 0);

spin_unlock_irqrestore(&mutex->wait_lock, flags);
-
- /*
- * Undo pi boosting when necessary.
- * If one of the awoken readers boosted us, we don't want to keep
- * that priority.
- */
- rt_mutex_adjust_prio(current);
-}
-
-void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
-{
- struct rt_mutex *mutex = &rwm->mutex;
-
- rwm->owner = NULL;
- atomic_set(&rwm->count, 0);
- atomic_set(&rwm->owners, 0);
- rwm->prio = MAX_PRIO;
- INIT_LIST_HEAD(&rwm->readers);
-
- __rt_mutex_init(mutex, name);
}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+/*
+ * These callbacks are invoked whenever a rwlock has changed priority.
+ * Since rwlocks maintain their own lists of reader dependencies, we
+ * may need to reboost any readers manually
+ */
+static inline int rt_rwlock_pi_boost(struct pi_sink *sink,
+ struct pi_source *src,
+ unsigned int flags)
{
- struct reader_lock_struct *rls;
struct rw_mutex *rwm;
- int lock_prio;
- int i;

- for (i = 0; i < task->reader_lock_count; i++) {
- rls = &task->owned_read_locks[i];
- rwm = rls->lock;
- if (rwm) {
- lock_prio = rwm->prio;
- if (prio > lock_prio)
- prio = lock_prio;
- }
- }
+ rwm = container_of(sink, struct rw_mutex, pi_sink);

- return prio;
+ /*
+ * We dont need to take any locks here because the
+ * lock->pi.node interlock is already guaranteeing mutual
+ * exclusion.
+ */
+ rwm->prio = *src->prio;
+
+ return 0;
}

-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
+static inline int rt_rwlock_pi_update(struct pi_sink *sink,
+ unsigned int flags)
{
+ struct rw_mutex *rwm;
+ struct rt_mutex *mutex;
struct reader_lock_struct *rls;
- struct rt_mutex_waiter *waiter;
- struct task_struct *task;
- struct rw_mutex *rwm = container_of(lock, struct rw_mutex, mutex);
+ unsigned long iflags;

- if (rt_mutex_has_waiters(lock)) {
- waiter = rt_mutex_top_waiter(lock);
- /*
- * Do we need to grab the task->pi_lock?
- * Really, we are only reading it. If it
- * changes, then that should follow this chain
- * too.
- */
- rwm->prio = waiter->task->prio;
- } else
- rwm->prio = MAX_PRIO;
+ rwm = container_of(sink, struct rw_mutex, pi_sink);
+ mutex = &rwm->mutex;

- if (recursion_depth >= MAX_RWLOCK_DEPTH) {
- WARN_ON(1);
- return 1;
- }
+ spin_lock_irqsave(&mutex->wait_lock, iflags);

- list_for_each_entry(rls, &rwm->readers, list) {
- task = rls->task;
- get_task_struct(task);
- /*
- * rt_mutex_adjust_prio_chain will do
- * the put_task_struct
- */
- rt_mutex_adjust_prio_chain(task, 0, orig_lock,
- orig_waiter, top_task,
- recursion_depth+1);
- }
+ list_for_each_entry(rls, &rwm->readers, list)
+ task_pi_boost(rls->task, &rls->pi_src, 0);
+
+ spin_unlock_irqrestore(&mutex->wait_lock, iflags);

return 0;
}
-#else
-static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock,
- struct rt_mutex_waiter *orig_waiter,
- struct task_struct *top_task,
- struct rt_mutex *lock,
- int recursion_depth)
-{
- return 0;
-}

-static int rt_mutex_get_readers_prio(struct task_struct *task, int prio)
+static struct pi_sink_ops rt_rwlock_pi_sink = {
+ .boost = rt_rwlock_pi_boost,
+ .update = rt_rwlock_pi_update,
+};
+
+void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name)
{
- return prio;
+ struct rt_mutex *mutex = &rwm->mutex;
+
+ rwm->owner = NULL;
+ atomic_set(&rwm->count, 0);
+ atomic_set(&rwm->owners, 0);
+ rwm->prio = MAX_PRIO;
+ INIT_LIST_HEAD(&rwm->readers);
+
+ __rt_mutex_init(mutex, name);
+
+ /*
+ * Link the rwlock object to the mutex so we get notified
+ * of any priority changes in the future
+ */
+ pi_sink_init(&rwm->pi_sink, &rt_rwlock_pi_sink);
+ pi_add_sink(&mutex->pi.node, &rwm->pi_sink,
+ PI_FLAG_DEFER_UPDATE | PI_FLAG_ALREADY_BOOSTED);
}
+
#endif /* CONFIG_PREEMPT_RT */

static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags)
@@ -2335,8 +1999,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
* but the lock got stolen by a higher prio task.
*/
if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
- detect_deadlock, flags);
+ ret = add_waiter(lock, &waiter, &flags);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
@@ -2374,7 +2037,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
set_current_state(TASK_RUNNING);

if (unlikely(waiter.task))
- remove_waiter(lock, &waiter, flags);
+ remove_waiter(lock, &waiter);

/*
* try_to_take_rt_mutex() sets the waiter bit
@@ -2388,13 +2051,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(timeout))
hrtimer_cancel(&timeout->timer);

- /*
- * Readjust priority, when we did not get the lock. We might
- * have been the pending owner and boosted. Since we did not
- * take the lock, the PI boost has to go.
- */
- if (unlikely(ret))
- rt_mutex_adjust_prio(current);
+ /* Undo any pi boosting, if necessary */
+ task_pi_update(current, 0);

/* Must we reaquire the BKL? */
if (unlikely(saved_lock_depth >= 0))
@@ -2457,8 +2115,8 @@ rt_mutex_slowunlock(struct rt_mutex *lock)

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

/*
@@ -2654,6 +2312,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
spin_lock_init(&lock->wait_lock);
plist_head_init(&lock->wait_list, &lock->wait_lock);

+ init_pi(lock);
+
debug_rt_mutex_init(lock, name);
}
EXPORT_SYMBOL_GPL(__rt_mutex_init);
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 70df5f5..b0c6c16 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -14,6 +14,7 @@

#include <linux/rtmutex.h>
#include <linux/rt_lock.h>
+#include <linux/pi.h>

/*
* The rtmutex in kernel tester is independent of rtmutex debugging. We
@@ -48,10 +49,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
*/
struct rt_mutex_waiter {
struct plist_node list_entry;
- struct plist_node pi_list_entry;
struct task_struct *task;
struct rt_mutex *lock;
int write_lock;
+ struct {
+ struct pi_sink sink;
+ int prio;
+ } pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;
struct pid *deadlock_task_pid;
@@ -79,18 +83,6 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
return w;
}

-static inline int task_has_pi_waiters(struct task_struct *p)
-{
- return !plist_head_empty(&p->pi_waiters);
-}
-
-static inline struct rt_mutex_waiter *
-task_top_pi_waiter(struct task_struct *p)
-{
- return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
- pi_list_entry);
-}
-
/*
* lock->owner state tracking:
*/
diff --git a/kernel/rwlock_torture.c b/kernel/rwlock_torture.c
index 2820815..689a0d0 100644
--- a/kernel/rwlock_torture.c
+++ b/kernel/rwlock_torture.c
@@ -682,37 +682,7 @@ static int __init mutex_stress_init(void)

print_owned_read_locks(tsks[i]);

- if (tsks[i]->pi_blocked_on) {
- w = (void *)tsks[i]->pi_blocked_on;
- mtx = w->lock;
- spin_unlock_irq(&tsks[i]->pi_lock);
- spin_lock_irq(&mtx->wait_lock);
- spin_lock(&tsks[i]->pi_lock);
- own = (unsigned long)mtx->owner & ~3UL;
- oops_in_progress++;
- printk("%s:%d is blocked on ",
- tsks[i]->comm, tsks[i]->pid);
- __print_symbol("%s", (unsigned long)mtx);
- if (own == 0x100)
- printk(" owner is READER\n");
- else if (!(own & ~300))
- printk(" owner is ILLEGAL!!\n");
- else if (!own)
- printk(" has no owner!\n");
- else {
- struct task_struct *owner = (void*)own;
-
- printk(" owner is %s:%d\n",
- owner->comm, owner->pid);
- }
- oops_in_progress--;
-
- spin_unlock(&tsks[i]->pi_lock);
- spin_unlock_irq(&mtx->wait_lock);
- } else {
- print_owned_read_locks(tsks[i]);
- spin_unlock_irq(&tsks[i]->pi_lock);
- }
+ spin_unlock_irq(&tsks[i]->pi_lock);
}
}
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index eb14b9f..d1db367 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2411,12 +2411,6 @@ task_pi_init(struct task_struct *p)
pi_source_init(&p->pi.src, &p->normal_prio);
task_pi_boost(p, &p->pi.src, PI_FLAG_DEFER_UPDATE);

-#ifdef CONFIG_RT_MUTEXES
- p->rtmutex_prio = MAX_PRIO;
- pi_source_init(&p->rtmutex_prio_src, &p->rtmutex_prio);
- task_pi_boost(p, &p->rtmutex_prio_src, PI_FLAG_DEFER_UPDATE);
-#endif
-
/*
* We add our own task as a dependency of ourselves so that
* we get boost-notifications (via task_pi_boost_cb) whenever
@@ -5027,7 +5021,6 @@ task_pi_update_cb(struct pi_sink *sink, unsigned int flags)
*/
if (unlikely(p == rq->idle)) {
WARN_ON(p != rq->curr);
- WARN_ON(p->pi_blocked_on);
goto out_unlock;
}

@@ -5358,7 +5351,6 @@ recheck:
spin_unlock_irqrestore(&p->pi_lock, flags);

task_pi_update(p, 0);
- rt_mutex_adjust_pi(p);

return 0;
}
@@ -8492,10 +8484,6 @@ void __init sched_init(void)

task_pi_init(&init_task);

-#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&init_task.pi_waiters, &init_task.pi_lock);
-#endif
-
/*
* The boot idle thread does lazy MMU switching as well:
*/

2008-08-15 20:33:20

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH RT RFC v4 8/8] rtmutex: pi-boost locks as late as possible

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance. An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart. However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath. The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
each spin of the lock (if we are not boosted already). If the
owner's effective priority is logically less than the waiters
priority, we must boost them.

2) We check the priority of ourselves against our current queue
position on the waiters-list (if we are not boosted already).
If our priority was changed, we need to re-queue ourselves to
update our position.

3) We break out of the adaptive-spin if either of the above
conditions (1), (2) change so that we can re-evaluate the
lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
voluntarily preempt since we are losing our ability to
dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a pro-active protocol
(boost on entry), verses a reactive protocol (boost when
necessary). The upside to the latter is that we don't take a
penalty for pi when it is not necessary (which is most of the time)
The downside is that we technically leave the owner exposed to
getting preempted (should it get asynchronously deprioritized), even
if our waiter is the highest priority task in the system. When this
happens, the owner would be immediately boosted (because we would
hit the "oncpu" condition, and subsequently follow the voluntary
preempt path which boosts the owner). Therefore, inversion is
correctly prevented, but we have the extra latency of the
preempt/boost/wakeup that could have been avoided in the proactive
model.

However, the design of the algorithm described above constrains the
probability of this phenomenon occurring to setscheduler()
operations. Since rt-locks do not support being interrupted by
signals or timeouts, waiters only depart via the acquisition path.
And while acquisitions do deboost the owner, the owner also
changes simultaneously, rending the deboost moot relative to the
other waiters.

What this all means is that the downside to this implementation is
that a high-priority waiter *may* see an extra latency (equivalent
to roughly two wake-ups) if the owner has its priority reduced via
setscheduler() while it holds the lock. The penalty is
deterministic, arguably small enough, and sufficiently rare that I
do not believe it should be an issue.

Note: If the concept of other exit paths are ever introduced in the
future, simply adapting the condition to look at owner->normal_prio
instead of owner->prio should once again constrain the limitation
to setscheduler().

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Peter Morreale <[email protected]>
---

include/linux/rtmutex.h | 1
kernel/rtmutex.c | 195 ++++++++++++++++++++++++++++++++++++-----------
kernel/rtmutex_common.h | 1
3 files changed, 153 insertions(+), 44 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index e069182..656610b 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -33,6 +33,7 @@ struct rt_mutex {
struct pi_node node;
struct pi_sink sink;
int prio;
+ int boosters;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 8acbf23..ef2b508 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
{
unsigned long val = (unsigned long)owner | mask;

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
struct task_struct *prev_owner = rt_mutex_owner(lock);

rtmutex_pi_owner(lock, prev_owner, 0);
rtmutex_pi_owner(lock, owner, 1);
+ }

+ if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;
- }

lock->owner = (struct task_struct *)val;
}
@@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *sink,

spin_lock_irqsave(&lock->wait_lock, iflags);

- if (rt_mutex_has_waiters(lock)) {
+ if (lock->pi.boosters) {
owner = rt_mutex_owner(lock);

if (owner && owner != RT_RW_READER) {
@@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
pi_node_init(&lock->pi.node);

lock->pi.prio = MAX_PRIO;
+ lock->pi.boosters = 0;
pi_source_init(&lock->pi.src, &lock->pi.prio);
pi_sink_init(&lock->pi.sink, &rtmutex_pi_sink);

@@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
}

+static inline void requeue_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ BUG_ON(!waiter->task);
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
/*
* These callbacks are invoked whenever a waiter has changed priority.
* So we should requeue it within the lock->wait_list
@@ -343,11 +355,8 @@ static inline int rtmutex_waiter_pi_update(struct pi_sink *sink,
* pi list. Therefore, if waiter->pi.prio has changed since we
* queued ourselves, requeue it.
*/
- if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
- plist_del(&waiter->list_entry, &lock->wait_list);
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
- }
+ if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
+ requeue_waiter(lock, waiter);

spin_unlock_irqrestore(&lock->wait_lock, iflags);

@@ -359,20 +368,9 @@ static struct pi_sink_ops rtmutex_waiter_pi_sink = {
.update = rtmutex_waiter_pi_update,
};

-/*
- * This must be called with lock->wait_lock held.
- */
-static int add_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter,
- unsigned long *flags)
+static void boost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
{
- int has_waiters = rt_mutex_has_waiters(lock);
-
- waiter->task = current;
- waiter->lock = lock;
- waiter->pi.prio = current->prio;
- plist_node_init(&waiter->list_entry, waiter->pi.prio);
- plist_add(&waiter->list_entry, &lock->wait_list);
pi_sink_init(&waiter->pi.sink, &rtmutex_waiter_pi_sink);

/*
@@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
* If we previously had no waiters, we are transitioning to
* a mode where we need to boost the owner
*/
- if (!has_waiters) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 1);
}

- spin_unlock_irqrestore(&lock->wait_lock, *flags);
- task_pi_update(current, 0);
- spin_lock_irqsave(&lock->wait_lock, *flags);
-
- return 0;
+ lock->pi.boosters++;
+ waiter->pi.boosted = 1;
}

-/*
- * Remove a waiter from a lock
- *
- * Must be called with lock->wait_lock held
- */
-static void remove_waiter(struct rt_mutex *lock,
- struct rt_mutex_waiter *waiter)
+static void deboost_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *p)
{
- struct task_struct *p = waiter->task;
+ BUG_ON(!waiter->pi.boosted);

- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->task = NULL;
+ waiter->pi.boosted = 0;
+ lock->pi.boosters--;

/*
* We can stop boosting the owner if there are no more waiters
*/
- if (!rt_mutex_has_waiters(lock)) {
+ if (!lock->pi.boosters) {
struct task_struct *owner = rt_mutex_owner(lock);
rtmutex_pi_owner(lock, owner, 0);
}
@@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
}

/*
+ * This must be called with lock->wait_lock held.
+ */
+static void _add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ waiter->task = current;
+ waiter->lock = lock;
+ waiter->pi.prio = current->prio;
+ plist_node_init(&waiter->list_entry, waiter->pi.prio);
+ plist_add(&waiter->list_entry, &lock->wait_list);
+}
+
+static int add_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ unsigned long *flags)
+{
+ _add_waiter(lock, waiter);
+
+ boost_lock(lock, waiter);
+
+ spin_unlock_irqrestore(&lock->wait_lock, *flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, *flags);
+
+ return 0;
+}
+
+/*
+ * Remove a waiter from a lock
+ *
+ * Must be called with lock->wait_lock held
+ */
+static void remove_waiter(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter)
+{
+ struct task_struct *p = waiter->task;
+
+ plist_del(&waiter->list_entry, &lock->wait_list);
+ waiter->task = NULL;
+
+ if (waiter->pi.boosted)
+ deboost_lock(lock, waiter, p);
+}
+
+/*
* Wake up the next waiter on the lock.
*
* Remove the top waiter from the current tasks waiter list and from
@@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
if (orig_owner != rt_mutex_owner(waiter->lock))
return 0;

+ /* Special handling for when we are not in pi-boost mode */
+ if (!waiter->pi.boosted) {
+ /*
+ * Are we higher priority than the owner? If so
+ * we should bail out immediately so that we can
+ * pi boost them.
+ */
+ if (current->prio < orig_owner->prio)
+ return 0;
+
+ /*
+ * Did our priority change? If so, we need to
+ * requeue our position in the list
+ */
+ if (waiter->pi.prio != current->prio)
+ return 0;
+ }
+
/* Owner went to bed, so should we */
if (!task_is_current(orig_owner))
return 1;
@@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
unsigned long saved_state, state, flags;
struct task_struct *orig_owner;
int missed = 0;
+ int boosted = 0;

init_waiter(&waiter);

@@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
}
missed = 1;

+ orig_owner = rt_mutex_owner(lock);
+
/*
* waiter.task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by an higher prio task.
*/
- if (!waiter.task) {
- add_waiter(lock, &waiter, &flags);
+ if (!waiter.task)
+ _add_waiter(lock, &waiter);
+
+ /*
+ * We only need to pi-boost the owner if they are lower
+ * priority than us. We dont care if this is racy
+ * against priority changes as we will break out of
+ * the adaptive spin anytime any priority changes occur
+ * without boosting enabled.
+ */
+ if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
/* Wakeup during boost ? */
if (unlikely(!waiter.task))
continue;
}

/*
+ * If we are not currently pi-boosting the lock, we have to
+ * monitor whether our priority changed since the last
+ * time it was recorded and requeue ourselves if it moves.
+ */
+ if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
+ waiter.pi.prio = current->prio;
+
+ requeue_waiter(lock, &waiter);
+ }
+
+ /*
* Prevent schedule() to drop BKL, while waiting for
* the lock ! We restore lock_depth when we come back.
*/
saved_flags = current->flags & PF_NOSCHED;
current->lock_depth = -1;
current->flags &= ~PF_NOSCHED;
- orig_owner = rt_mutex_owner(lock);
get_task_struct(orig_owner);
spin_unlock_irqrestore(&lock->wait_lock, flags);

@@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* barrier which we rely upon to ensure current->state
* is visible before we test waiter.task.
*/
+ if (waiter.task && !waiter.pi.boosted) {
+ spin_lock_irqsave(&lock->wait_lock, flags);
+
+ /*
+ * We get here if we have not yet boosted
+ * the lock, yet we are going to sleep. If
+ * we are still pending (waiter.task != 0),
+ * then go ahead and boost them now
+ */
+ if (waiter.task) {
+ boost_lock(lock, &waiter);
+ boosted = 1;
+ }
+
+ spin_unlock_irqrestore(&lock->wait_lock, flags);
+ task_pi_update(current, 0);
+ }
+
if (waiter.task)
schedule_rt_mutex(lock);
} else
@@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_unlock_irqrestore(&lock->wait_lock, flags);

/* Undo any pi boosting, if necessary */
- task_pi_update(current, 0);
+ if (boosted)
+ task_pi_update(current, 0);

debug_rt_mutex_free_waiter(&waiter);
}
@@ -708,6 +810,7 @@ static void noinline __sched
rt_spin_lock_slowunlock(struct rt_mutex *lock)
{
unsigned long flags;
+ int deboost = 0;

spin_lock_irqsave(&lock->wait_lock, flags);

@@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
return;
}

+ if (lock->pi.boosters)
+ deboost = 1;
+
wakeup_next_waiter(lock, 1);

spin_unlock_irqrestore(&lock->wait_lock, flags);

- /* Undo pi boosting when necessary */
- task_pi_update(current, 0);
+ if (deboost)
+ /* Undo pi boosting when necessary */
+ task_pi_update(current, 0);
}

void __lockfunc rt_spin_lock(spinlock_t *lock)
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index b0c6c16..8d4f745 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -55,6 +55,7 @@ struct rt_mutex_waiter {
struct {
struct pi_sink sink;
int prio;
+ int boosted;
} pi;
#ifdef CONFIG_DEBUG_RT_MUTEXES
unsigned long ip;

2008-08-15 20:35:23

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Gregory Haskins wrote:
> The kernel currently addresses priority-inversion through priority-
> inheritence. However, all of the priority-inheritence logic is
> integrated into the Real-Time Mutex infrastructure. This causes a few
> problems:
>
> 1) This tightly coupled relationship makes it difficult to extend to
> other areas of the kernel (for instance, pi-aware wait-queues may
> be desirable).
> 2) Enhancing the rtmutex infrastructure becomes challenging because
> there is no seperation between the locking code, and the pi-code.
>
> This patch aims to rectify these shortcomings by designing a stand-alone
> pi framework which can then be used to replace the rtmutex-specific
> version. The goal of this framework is to provide similar functionality
> to the existing subsystem, but with sole focus on PI and the
> relationships between objects that can boost priority, and the objects
> that get boosted.
>
> We introduce the concept of a "pi_source" and a "pi_sink", where, as the
> name suggests provides the basic relationship of a priority source, and
> its boosted target. A pi_source acts as a reference to some arbitrary
> source of priority, and a pi_sink can be boosted (or deboosted) by
> a pi_source. For more details, please read the library documentation.
>
> There are currently no users of this inteface.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> Documentation/libpi.txt | 59 ++++++
> include/linux/pi.h | 293 ++++++++++++++++++++++++++++
> lib/Makefile | 3
> lib/pi.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 843 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/libpi.txt
> create mode 100644 include/linux/pi.h
> create mode 100644 lib/pi.c
>
> diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
> new file mode 100644
> index 0000000..197b21a
> --- /dev/null
> +++ b/Documentation/libpi.txt
> @@ -0,0 +1,59 @@
> +lib/pi.c - Priority Inheritance library
> +
> +Sources and sinks:
> +------------
> +
> +This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a priority source, and its boosted target.
> +
> +A pi_source is simply a reference to some arbitrary priority value that may range from 0 (highest prio), to MAX_PRIO (currently 140, lowest prio). A pi_source calls pi_sink.boost() whenever it wishes to boost the sink to (at least minimally) the priority value that the source represents. It uses pi_sink.boost() for both the initial boosting, or for any subsequent refreshes to the value (even if the value is decreasing in logical priority). The policy of the sink will dictate what happens as a result of that boost. Likewise, a pi_source calls pi_sink.deboost() to stop contributing to the sink's minimum priority.
> +
> +It is important to note that a source is a reference to a priority value, not a value itself. This is one of the concepts that allows the interface to be idempotent, which is important for properly updating a chain of sources and sinks in the proper order. If we passed the priority on the stack, the order in which the system executes could allow the actual value that is set to race.
> +
> +Nodes:
> +
> +A pi_node is a convenience object which is simultaneously a source and a sink. As its name suggests, it would typically be deployed as a node in a pi-chain. Other pi_sources can boost a node via its pi_sink.boost() interface. Likewise, a node can boost a fixed number of sinks via the node.add_sink() interface.
> +
> +Generally speaking, a node takes care of many common operations associated with being a “link in the chain”, such as:
> +
> + 1) determining the current priority of the node based on the (logically) highest priority source that is boosting the node.
> + 2) boosting/deboosting upstream sinks whenever the node locally changes priority.
> + 3) taking care to avoid deadlock during a chain update.
> +
> +Design details:
> +
> +Destruction:
> +
> +The pi-library objects are designed to be implicitly-destructable (meaning they do not require an explicit “free()” operation when they are not used anymore). This is important considering their intended use (spinlock_t's which are also implicitly-destructable). As such, any allocations needed for operation must come from internal structure storage as there will be no opportunity to free it later.
> +
> +Multiple sinks per Node:
> +
> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. “task->pi_blocked_on”). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called “leaf sinks”.
> +
> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
> +
> +The following diagram depicts an example relationship (warning: cheesy ascii art)
> +
> + --------- ---------
> + | leaf | | leaf |
> + --------- ---------
> + / /
> + --------- / ---------- / --------- ---------
> + ->-| node |->---| node |-->---| node |->---| leaf |
> + --------- ---------- --------- ---------
> +
> +The reason why this was done was to unify the notion of a “sink” to a single interface, rather than having something like task->pi_blocks_on and a separate callback for the leaf action. Instead, any downstream object can be represented by a sink, and the implementation details are hidden (e.g. im a task, im a lock, im a node, im a work-item, im a wait-queue, etc).
> +
> +Sinkrefs:
> +
> +Each pi_sink.boost() operation is represented by a unique pi_source to properly facilitate a one node to many source relationship. Therefore, if a pi_node is to act as aggregator to multiple sinks, it implicitly must have one internal pi_source object for every sink that is added (via node.add_sink(). This pi_source object has to be internally managed for the lifetime of the sink reference.
> +
> +Recall that due to the implicit-destruction requirement above, and the fact that we will typically be executing in a preempt-disabled region, we have to be very careful about how we allocate references to those sinks. More on that next. But long story short we limit the number of sinks to MAX_PI_DEPENDENDICES (currently 5).
> +
> +Locking:
> +
> +(work in progress....)
> +
> +
> +
> +
> +
> diff --git a/include/linux/pi.h b/include/linux/pi.h
> new file mode 100644
> index 0000000..5535474
> --- /dev/null
> +++ b/include/linux/pi.h
> @@ -0,0 +1,293 @@
> +/*
> + * see Documentation/libpi.txt for details
> + */
> +
> +#ifndef _LINUX_PI_H
> +#define _LINUX_PI_H
> +
> +#include <linux/list.h>
> +#include <linux/plist.h>
> +#include <asm/atomic.h>
> +
> +#define MAX_PI_DEPENDENCIES 5
> +
> +struct pi_source {
> + struct plist_node list;
> + int *prio;
> + int boosted;
> +};
> +
> +
> +#define PI_FLAG_DEFER_UPDATE (1 << 0)
> +#define PI_FLAG_ALREADY_BOOSTED (1 << 1)
> +
> +struct pi_sink;
> +
> +struct pi_sink_ops {
> + int (*boost)(struct pi_sink *sink, struct pi_source *src,
> + unsigned int flags);
> + int (*deboost)(struct pi_sink *sink, struct pi_source *src,
> + unsigned int flags);
> + int (*update)(struct pi_sink *sink,
> + unsigned int flags);
> + int (*free)(struct pi_sink *sink,
> + unsigned int flags);
> +};
> +
> +struct pi_sink {
> + atomic_t refs;
> + struct pi_sink_ops *ops;
> +};
> +
> +enum pi_state {
> + pi_state_boost,
> + pi_state_boosted,
> + pi_state_deboost,
> + pi_state_free,
> +};
> +
> +/*
> + * NOTE: PI must always use a true (e.g. raw) spinlock, since it is used by
> + * rtmutex infrastructure.
> + */
> +
> +struct pi_sinkref {
> + raw_spinlock_t lock;
> + struct list_head list;
> + enum pi_state state;
> + struct pi_sink *sink;
> + struct pi_source src;
> + atomic_t refs;
> +};
> +
> +struct pi_sinkref_pool {
> + struct list_head free;
> + struct pi_sinkref data[MAX_PI_DEPENDENCIES];
> +};
> +
> +struct pi_node {
> + raw_spinlock_t lock;
> + int prio;
> + struct pi_sink sink;
> + struct pi_sinkref_pool sinkref_pool;
> + struct list_head sinks;
> + struct plist_head srcs;
> +};
> +
> +/**
> + * pi_node_init - initialize a pi_node before use
> + * @node: a node context
> + */
> +extern void pi_node_init(struct pi_node *node);
> +
> +/**
> + * pi_add_sink - add a sink as an downstream object
> + * @node: the node context
> + * @sink: the sink context to add to the node
> + * @flags: optional flags to modify behavior
> + * PI_FLAG_DEFER_UPDATE - Do not perform sync update
> + * PI_FLAG_ALREADY_BOOSTED - Do not perform initial boosting
> + *
> + * This function registers a sink to get notified whenever the
> + * node changes priority.
> + *
> + * Note: By default, this function will schedule the newly added sink
> + * to get an inital boost notification on the next update (even
> + * without the presence of a priority transition). However, if the
> + * ALREADY_BOOSTED flag is specified, the sink is initially marked as
> + * BOOSTED and will only get notified if the node changes priority
> + * in the future.
> + *
> + * Note: By default, this function will synchronously update the
> + * chain unless the DEFER_UPDATE flag is specified.
> + *
> + * Returns: (int)
> + * 0 = success
> + * any other value = failure
> + */
> +extern int pi_add_sink(struct pi_node *node, struct pi_sink *sink,
> + unsigned int flags);
> +
> +/**
> + * pi_del_sink - del a sink from the current downstream objects
> + * @node: the node context
> + * @sink: the sink context to delete from the node
> + * @flags: optional flags to modify behavior
> + * PI_FLAG_DEFER_UPDATE - Do not perform sync update
> + *
> + * This function unregisters a sink from the node.
> + *
> + * Note: The sink will not actually become fully deboosted until
> + * a call to node.update() successfully returns.
> + *
> + * Note: By default, this function will synchronously update the
> + * chain unless the DEFER_UPDATE flag is specified.
> + *
> + * Returns: (int)
> + * 0 = success
> + * any other value = failure
> + */
> +extern int pi_del_sink(struct pi_node *node, struct pi_sink *sink,
> + unsigned int flags);
> +
> +/**
> + * pi_sink_init - initialize a pi_sink before use
> + * @sink: a sink context
> + * @ops: pointer to an pi_sink_ops structure
> + */
> +static inline void
> +pi_sink_init(struct pi_sink *sink, struct pi_sink_ops *ops)
> +{
> + atomic_set(&sink->refs, 0);
> + sink->ops = ops;
> +}
> +
> +/**
> + * pi_source_init - initialize a pi_source before use
> + * @src: a src context
> + * @prio: pointer to a priority value
> + *
> + * A pointer to a priority value is used so that boost and update
> + * are fully idempotent.
> + */
> +static inline void
> +pi_source_init(struct pi_source *src, int *prio)
> +{
> + plist_node_init(&src->list, *prio);
> + src->prio = prio;
> + src->boosted = 0;
> +}
> +
> +/**
> + * pi_boost - boost a node with a pi_source
> + * @node: the node context
> + * @src: the src context to boost the node with
> + * @flags: optional flags to modify behavior
> + * PI_FLAG_DEFER_UPDATE - Do not perform sync update
> + *
> + * This function registers a priority source with the node, possibly
> + * boosting its value if the new source is the highest registered source.
> + *
> + * This function is used to both initially register a source, as well as
> + * to notify the node if the value changes in the future (even if the
> + * priority is decreasing).
> + *
> + * Note: By default, this function will synchronously update the
> + * chain unless the DEFER_UPDATE flag is specified.
> + *
> + * Returns: (int)
> + * 0 = success
> + * any other value = failure
> + */
> +static inline int
> +pi_boost(struct pi_node *node, struct pi_source *src, unsigned int flags)
> +{
> + struct pi_sink *sink = &node->sink;
> +
> + if (sink->ops->boost)
> + return sink->ops->boost(sink, src, flags);
> +
> + return 0;
> +}
> +
> +/**
> + * pi_deboost - deboost a pi_source from a node
> + * @node: the node context
> + * @src: the src context to boost the node with
> + * @flags: optional flags to modify behavior
> + * PI_FLAG_DEFER_UPDATE - Do not perform sync update
> + *
> + * This function unregisters a priority source from the node, possibly
> + * deboosting its value if the departing source was the highest
> + * registered source.
> + *
> + * Note: By default, this function will synchronously update the
> + * chain unless the DEFER_UPDATE flag is specified.
> + *
> + * Returns: (int)
> + * 0 = success
> + * any other value = failure
> + */
> +static inline int
> +pi_deboost(struct pi_node *node, struct pi_source *src, unsigned int flags)
> +{
> + struct pi_sink *sink = &node->sink;
> +
> + if (sink->ops->deboost)
> + return sink->ops->deboost(sink, src, flags);
> +
> + return 0;
> +}
> +
> +/**
> + * pi_update - force a manual chain update
> + * @node: the node context
> + * @flags: optional flags to modify behavior. Reserved, must be 0.
> + *
> + * This function will push any priority changes (as a result of
> + * boost/deboost or add_sink/del_sink) down through the chain.
> + * If no changes are necessary, this function is a no-op.
> + *
> + * Returns: (int)
> + * 0 = success
> + * any other value = failure
> + */
> +static inline int
> +pi_update(struct pi_node *node, unsigned int flags)
> +{
> + struct pi_sink *sink = &node->sink;
> +
> + if (sink->ops->update)
> + return sink->ops->update(sink, flags);
> +
> + return 0;
> +}
> +
> +/**
> + * pi_sink_put - down the reference count, freeing the sink if 0
> + * @node: the node context
> + * @flags: optional flags to modify behavior. Reserved, must be 0.
> + *
> + * Returns: none
> + */
> +static inline void
> +pi_sink_put(struct pi_sink *sink, unsigned int flags)
> +{
> + if (atomic_dec_and_test(&sink->refs)) {
> + if (sink->ops->free)
> + sink->ops->free(sink, flags);
> + }
> +}
> +
> +
> +/**
> + * pi_get - up the reference count
> + * @node: the node context
> + * @flags: optional flags to modify behavior. Reserved, must be 0.
> + *
> + * Returns: none
> + */
> +static inline void
> +pi_get(struct pi_node *node, unsigned int flags)
> +{
> + struct pi_sink *sink = &node->sink;
> +
> + atomic_inc(&sink->refs);
> +}
> +
> +/**
> + * pi_put - down the reference count, freeing the node if 0
> + * @node: the node context
> + * @flags: optional flags to modify behavior. Reserved, must be 0.
> + *
> + * Returns: none
> + */
> +static inline void
> +pi_put(struct pi_node *node, unsigned int flags)
> +{
> + struct pi_sink *sink = &node->sink;
> +
> + pi_sink_put(sink, flags);
> +}
> +
> +#endif /* _LINUX_PI_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index 5187924..df81ad7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -23,7 +23,8 @@ lib-$(CONFIG_SMP) += cpumask.o
> lib-y += kobject.o kref.o klist.o
>
> obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
> - bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o
> + bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
> + pi.o
>
> ifeq ($(CONFIG_DEBUG_KOBJECT),y)
> CFLAGS_kobject.o += -DDEBUG
> diff --git a/lib/pi.c b/lib/pi.c
> new file mode 100644
> index 0000000..d00042c
> --- /dev/null
> +++ b/lib/pi.c
> @@ -0,0 +1,489 @@
> +/*
> + * lib/pi.c
> + *
> + * Priority-Inheritance library
> + *
> + * Copyright (C) 2008 Novell
> + *
> + * Author: Gregory Haskins <[email protected]>
> + *
> + * This code provides a generic framework for preventing priority
> + * inversion by means of priority-inheritance. (see Documentation/libpi.txt
> + * for details)
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/module.h>
> +#include <linux/pi.h>
> +
> +
> +struct updater {
> + int update;
> + struct pi_sinkref *sinkref;
> + struct pi_sink *sink;
> +};
> +
> +/*
> + *-----------------------------------------------------------
> + * pi_sinkref_pool
> + *-----------------------------------------------------------
> + */
> +
> +static void
> +pi_sinkref_pool_init(struct pi_sinkref_pool *pool)
> +{
> + int i;
> +
> + INIT_LIST_HEAD(&pool->free);
> +
> + for (i = 0; i < MAX_PI_DEPENDENCIES; ++i) {
> + struct pi_sinkref *sinkref = &pool->data[i];
> +
> + memset(sinkref, 0, sizeof(*sinkref));
> + INIT_LIST_HEAD(&sinkref->list);
> + list_add_tail(&sinkref->list, &pool->free);
> + }
> +}
> +
> +static struct pi_sinkref *
> +pi_sinkref_alloc(struct pi_sinkref_pool *pool)
> +{
> + struct pi_sinkref *sinkref;
> +
> + if (list_empty(&pool->free))
> + return NULL;
> +
> + sinkref = list_first_entry(&pool->free, struct pi_sinkref, list);
> + list_del(&sinkref->list);
> + memset(sinkref, 0, sizeof(*sinkref));
> +
> + return sinkref;
> +}
> +
> +static void
> +pi_sinkref_free(struct pi_sinkref_pool *pool,
> + struct pi_sinkref *sinkref)
> +{
> + list_add_tail(&sinkref->list, &pool->free);
> +}
> +
> +/*
> + *-----------------------------------------------------------
> + * pi_sinkref
> + *-----------------------------------------------------------
> + */
> +
> +static inline void
> +_pi_sink_get(struct pi_sinkref *sinkref)
> +{
> + atomic_inc(&sinkref->sink->refs);
> + atomic_inc(&sinkref->refs);
> +}
> +
> +static inline void
> +_pi_sink_put_local(struct pi_node *node, struct pi_sinkref *sinkref)
> +{
> + if (atomic_dec_and_lock(&sinkref->refs, &node->lock)) {
> + list_del(&sinkref->list);
> + pi_sinkref_free(&node->sinkref_pool, sinkref);
> + spin_unlock(&node->lock);
> + }
> +}
> +
> +static inline void
> +_pi_sink_put_all(struct pi_node *node, struct pi_sinkref *sinkref)
> +{
> + struct pi_sink *sink = sinkref->sink;
> +
> + _pi_sink_put_local(node, sinkref);
> + pi_sink_put(sink, 0);
> +}
> +
> +/*
> + *-----------------------------------------------------------
> + * pi_node
> + *-----------------------------------------------------------
> + */
> +
> +static struct pi_node *node_of(struct pi_sink *sink)
> +{
> + return container_of(sink, struct pi_node, sink);
> +}
> +
> +static inline void
> +__pi_boost(struct pi_node *node, struct pi_source *src)
> +{
> + BUG_ON(src->boosted);
> +
> + plist_node_init(&src->list, *src->prio);
> + plist_add(&src->list, &node->srcs);
> + src->boosted = 1;
> +}
> +
> +static inline void
> +__pi_deboost(struct pi_node *node, struct pi_source *src)
> +{
> + BUG_ON(!src->boosted);
> +
> + plist_del(&src->list, &node->srcs);
> + src->boosted = 0;
> +}
> +
> +/*
> + * _pi_node_update - update the chain
> + *
> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> + * that need to propagate up the chain. This is a step-wise process where we
> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
> + * times, we guarantee that this update routine is an effective barrier...
> + * all modifications made prior to the call to this barrier will have completed.
> + *
> + * Deadlock avoidance: This node may participate in a chain of nodes which
> + * form a graph of arbitrary structure. While the graph should technically
> + * never close on itself barring any bugs, we still want to protect against
> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> + * from detecting this potential). To do this, we employ a dual-locking
> + * scheme where we can carefully control the order. That is: node->lock
> + * protects most of the node's internal state, but it will never be held
> + * across a chain update. sinkref->lock, on the other hand, can be held
> + * across a boost/deboost, and also guarantees proper execution order. Also
> + * note that no locks are held across an sink->update.
> + */
> +static int
> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> +{
> + struct pi_node *node = node_of(sink);
> + struct pi_sinkref *sinkref;
> + unsigned long iflags;
> + int count = 0;
> + int i;
> + int pprio;
> + struct updater updaters[MAX_PI_DEPENDENCIES];
> +
> + spin_lock_irqsave(&node->lock, iflags);
> +
> + pprio = node->prio;
> +
> + if (!plist_head_empty(&node->srcs))
> + node->prio = plist_first(&node->srcs)->prio;
> + else
> + node->prio = MAX_PRIO;
> +
> + list_for_each_entry(sinkref, &node->sinks, list) {
> + /*
> + * If the priority is changing, or if this is a
> + * BOOST/DEBOOST, we consider this sink "stale"
> + */
> + if (pprio != node->prio
> + || sinkref->state != pi_state_boosted) {
> + struct updater *iter = &updaters[count++];
> +
> + BUG_ON(!atomic_read(&sinkref->sink->refs));
> + _pi_sink_get(sinkref);
> +
> + iter->update = 1;
> + iter->sinkref = sinkref;
> + iter->sink = sinkref->sink;
> + }
> + }
> +
> + spin_unlock(&node->lock);
> +
> + for (i = 0; i < count; ++i) {
> + struct updater *iter = &updaters[i];
> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
> + struct pi_sink *sink;
> +
> + sinkref = iter->sinkref;
> + sink = iter->sink;
> +
> + spin_lock(&sinkref->lock);
> +
> + switch (sinkref->state) {
> + case pi_state_boost:
> + sinkref->state = pi_state_boosted;
> + /* Fall through */
> + case pi_state_boosted:
> + sink->ops->boost(sink, &sinkref->src, lflags);
> + break;
> + case pi_state_deboost:
> + sink->ops->deboost(sink, &sinkref->src, lflags);
> + sinkref->state = pi_state_free;
> +
> + /*
> + * drop the ref that we took when the sinkref
> + * was allocated. We still hold a ref from
> + * above.
> + */
> + _pi_sink_put_all(node, sinkref);
> + break;
> + case pi_state_free:
> + iter->update = 0;
> + break;
> + default:
> + panic("illegal sinkref type: %d", sinkref->state);
> + }
> +
> + spin_unlock(&sinkref->lock);
> +
> + /*
> + * We will drop the sinkref reference while still holding the
> + * preempt/irqs off so that the memory is returned synchronously
> + * to the system.
> + */
> + _pi_sink_put_local(node, sinkref);
> + }
> +
> + local_irq_restore(iflags);
> +
> + /*
> + * Note: At this point, sinkref is invalid since we put'd
> + * it above, but sink is valid since we still hold the remote
> + * reference. This is key to the design because it allows us
> + * to synchronously free the sinkref object, yet maintain a
> + * reference to the sink across the update
> + */
> + for (i = 0; i < count; ++i) {
> + struct updater *iter = &updaters[i];
> +
> + if (iter->update)
> + iter->sink->ops->update(iter->sink, 0);
> + }
> +
> + /*
> + * We perform all the free opertations together at the end, using
> + * only automatic/stack variables since any one of these operations
> + * could result in our node object being deallocated
> + */
> + for (i = 0; i < count; ++i) {
> + struct updater *iter = &updaters[i];
> +
> + pi_sink_put(iter->sink, 0);
> + }
> +
> + return 0;
> +}
> +
> +static int
> +_pi_del_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
> +{
> + struct pi_sinkref *sinkref;
> + struct updater updaters[MAX_PI_DEPENDENCIES];
> + unsigned long iflags;
> + int count = 0;
> + int i;
> +
> + local_irq_save(iflags);
> + spin_lock(&node->lock);
> +
> + list_for_each_entry(sinkref, &node->sinks, list) {
> + if (!sink || sink == sinkref->sink) {
> + struct updater *iter = &updaters[count++];
> +
> + _pi_sink_get(sinkref);
> + iter->sinkref = sinkref;
> + iter->sink = sinkref->sink;
> + }
> + }
> +
> + spin_unlock(&node->lock);
> +
> + for (i = 0; i < count; ++i) {
> + struct updater *iter = &updaters[i];
> + int remove = 0;
> +
> + sinkref = iter->sinkref;
> +
> + spin_lock(&sinkref->lock);
> +
> + switch (sinkref->state) {
> + case pi_state_boost:
> + /*
> + * This state indicates the sink was never formally
> + * boosted so we can just delete it immediately
> + */
> + remove = 1;
> + break;
> + case pi_state_boosted:
> + if (sinkref->sink->ops->deboost)
> + /*
> + * If the sink supports deboost notification,
> + * schedule it for deboost at the next update
> + */
> + sinkref->state = pi_state_deboost;
> + else
> + /*
> + * ..otherwise schedule it for immediate
> + * removal
> + */
> + remove = 1;
> + break;
> + default:
> + break;
> + }
> +
> + if (remove) {
> + /*
> + * drop the ref that we took when the sinkref
> + * was allocated. We still hold a ref from
> + * above
> + */
> + _pi_sink_put_all(node, sinkref);
> + sinkref->state = pi_state_free;
> + }
> +
> + spin_unlock(&sinkref->lock);
> +
> + _pi_sink_put_local(node, sinkref);
> + }
> +
> + local_irq_restore(iflags);
> +
> + for (i = 0; i < count; ++i)
> + pi_sink_put(updaters[i].sink, 0);
> +
> + if (!(flags & PI_FLAG_DEFER_UPDATE))
> + _pi_node_update(&node->sink, 0);
> +
> + return 0;
> +}
> +
> +static int
> +_pi_node_boost(struct pi_sink *sink, struct pi_source *src,
> + unsigned int flags)
> +{
> + struct pi_node *node = node_of(sink);
> + unsigned long iflags;
> +
> + spin_lock_irqsave(&node->lock, iflags);
> + if (src->boosted)
> + __pi_deboost(node, src);
> + __pi_boost(node, src);
> + spin_unlock_irqrestore(&node->lock, iflags);
> +
> + if (!(flags & PI_FLAG_DEFER_UPDATE))
> + _pi_node_update(sink, 0);
> +
> + return 0;
> +}
> +
> +static int
> +_pi_node_deboost(struct pi_sink *sink, struct pi_source *src,
> + unsigned int flags)
> +{
> + struct pi_node *node = node_of(sink);
> + unsigned long iflags;
> +
> + spin_lock_irqsave(&node->lock, iflags);
> + __pi_deboost(node, src);
> + spin_unlock_irqrestore(&node->lock, iflags);
> +
> + if (!(flags & PI_FLAG_DEFER_UPDATE))
> + _pi_node_update(sink, 0);
> +
> + return 0;
> +}
> +
> +static int
> +_pi_node_free(struct pi_sink *sink, unsigned int flags)
> +{
> + struct pi_node *node = node_of(sink);
> +
> + /*
> + * When the node is freed, we should perform an implicit
> + * del_sink on any remaining sinks we may have.
> + */
> + return _pi_del_sink(node, NULL, flags);
> +}
> +
> +static struct pi_sink_ops pi_node_sink = {
> + .boost = _pi_node_boost,
> + .deboost = _pi_node_deboost,
> + .update = _pi_node_update,
> + .free = _pi_node_free,
> +};
> +
> +void
> +pi_node_init(struct pi_node *node)
> +{
> + spin_lock_init(&node->lock);
> + node->prio = MAX_PRIO;
> + atomic_set(&node->sink.refs, 1);
> + node->sink.ops = &pi_node_sink;
>
^^^^^^

Note to self: this should use pi_sink_init()


> + pi_sinkref_pool_init(&node->sinkref_pool);
> + INIT_LIST_HEAD(&node->sinks);
> + plist_head_init(&node->srcs, &node->lock);
> +}
> +
> +int
> +pi_add_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
> +{
> + struct pi_sinkref *sinkref;
> + int ret = 0;
> + unsigned long iflags;
> +
> + spin_lock_irqsave(&node->lock, iflags);
> +
> + if (!atomic_read(&node->sink.refs)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + sinkref = pi_sinkref_alloc(&node->sinkref_pool);
> + if (!sinkref) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + spin_lock_init(&sinkref->lock);
> + INIT_LIST_HEAD(&sinkref->list);
> +
> + if (flags & PI_FLAG_ALREADY_BOOSTED)
> + sinkref->state = pi_state_boosted;
> + else
> + /*
> + * Schedule it for addition at the next update
> + */
> + sinkref->state = pi_state_boost;
> +
> + pi_source_init(&sinkref->src, &node->prio);
> + sinkref->sink = sink;
> +
> + /* set one ref from ourselves. It will be dropped on del_sink */
> + atomic_inc(&sinkref->sink->refs);
> + atomic_set(&sinkref->refs, 1);
> +
> + list_add_tail(&sinkref->list, &node->sinks);
> +
> + spin_unlock_irqrestore(&node->lock, iflags);
> +
> + if (!(flags & PI_FLAG_DEFER_UPDATE))
> + _pi_node_update(&node->sink, 0);
> +
> + return 0;
> +
> + out:
> + spin_unlock_irqrestore(&node->lock, iflags);
> +
> + return ret;
> +}
> +
> +int
> +pi_del_sink(struct pi_node *node, struct pi_sink *sink, unsigned int flags)
> +{
> + /*
> + * There may be multiple matches to sink because sometimes a
> + * deboost/free may still be pending an update when the same
> + * node has been added. So we want to process any and all
> + * instances that match our target
> + */
> + return _pi_del_sink(node, sink, flags);
> +}
> +
> +
> +
>
>

2008-08-16 16:00:54

by Matthias Behr

[permalink] [raw]
Subject: AW: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Hi Greg,

I got a few review comments/questions. Pls see below.

Best Regards,
Matthias

P.S. I'm a kernel newbie so don't hesitate to tell me if I'm wrong ;-)

> +/**
> + * pi_sink_init - initialize a pi_sink before use
> + * @sink: a sink context
> + * @ops: pointer to an pi_sink_ops structure
> + */
> +static inline void
> +pi_sink_init(struct pi_sink *sink, struct pi_sink_ops *ops)
> +{
> + atomic_set(&sink->refs, 0);
> + sink->ops = ops;
> +}

Shouldn't ops be tested for 0 here? (ASSERT/BUG_ON/...) (get's dereferenced later quite often in the form "if (sink->ops->...)".

> +/**
> + * pi_sink_put - down the reference count, freeing the sink if 0
> + * @node: the node context
> + * @flags: optional flags to modify behavior. Reserved, must be 0.
> + *
> + * Returns: none
> + */
> +static inline void
> +pi_sink_put(struct pi_sink *sink, unsigned int flags)
> +{
> + if (atomic_dec_and_test(&sink->refs)) {
> + if (sink->ops->free)
> + sink->ops->free(sink, flags);
> + }
> +}

Shouldn't the atomic/locked part cover the ...->free(...) as well? A pi_get right after the atomic_dec_and_test but before the free() could lead to a free() with refs>0?


2008-08-16 19:56:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

On Fri, 2008-08-15 at 16:28 -0400, Gregory Haskins wrote:
> The kernel currently addresses priority-inversion through priority-
> inheritence. However, all of the priority-inheritence logic is
> integrated into the Real-Time Mutex infrastructure. This causes a few
> problems:
>
> 1) This tightly coupled relationship makes it difficult to extend to
> other areas of the kernel (for instance, pi-aware wait-queues may
> be desirable).
> 2) Enhancing the rtmutex infrastructure becomes challenging because
> there is no seperation between the locking code, and the pi-code.
>
> This patch aims to rectify these shortcomings by designing a stand-alone
> pi framework which can then be used to replace the rtmutex-specific
> version. The goal of this framework is to provide similar functionality
> to the existing subsystem, but with sole focus on PI and the
> relationships between objects that can boost priority, and the objects
> that get boosted.
>
> We introduce the concept of a "pi_source" and a "pi_sink", where, as the
> name suggests provides the basic relationship of a priority source, and
> its boosted target. A pi_source acts as a reference to some arbitrary
> source of priority, and a pi_sink can be boosted (or deboosted) by
> a pi_source. For more details, please read the library documentation.
>
> There are currently no users of this inteface.

You should have started out by discussing your design - the document
just rambles a bit about some implementation details - it doesn't talk
about how it maps to the PI problem space.

Anyway - from what I can make of the code, you managed to convert the pi
graph walking code that used to be in rt_mutex_adjust_prio_chain() and
was iterative, into a recursive function call.

Not something you should do lightly..

2008-08-19 08:37:20

by Gregory Haskins

[permalink] [raw]
Subject: Re: AW: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Hi Matthias,

Matthias Behr wrote:
> Hi Greg,
>
> I got a few review comments/questions. Pls see below.
>
> Best Regards,
> Matthias
>
> P.S. I'm a kernel newbie so don't hesitate to tell me if I'm wrong ;-)
>
>
>> +/**
>> + * pi_sink_init - initialize a pi_sink before use
>> + * @sink: a sink context
>> + * @ops: pointer to an pi_sink_ops structure
>> + */
>> +static inline void
>> +pi_sink_init(struct pi_sink *sink, struct pi_sink_ops *ops)
>> +{
>> + atomic_set(&sink->refs, 0);
>> + sink->ops = ops;
>> +}
>>
>
> Shouldn't ops be tested for 0 here? (ASSERT/BUG_ON/...) (get's dereferenced later quite often in the form "if (sink->ops->...)".
>

This is a good idea. I will add this.

>
>> +/**
>> + * pi_sink_put - down the reference count, freeing the sink if 0
>> + * @node: the node context
>> + * @flags: optional flags to modify behavior. Reserved, must be 0.
>> + *
>> + * Returns: none
>> + */
>> +static inline void
>> +pi_sink_put(struct pi_sink *sink, unsigned int flags)
>> +{
>> + if (atomic_dec_and_test(&sink->refs)) {
>> + if (sink->ops->free)
>> + sink->ops->free(sink, flags);
>> + }
>> +}
>>
>
> Shouldn't the atomic/locked part cover the ...->free(...) as well?

Actually, it already does. The free can only be called by the last
reference dropping the ref-count.


> A pi_get right after the atomic_dec_and_test but before the free() could lead to a free() with refs>0?
>

A pi_get() after the ref could have already dropped to zero is broken at
a higher layer. E.g. the caller of pi_get() has to ensure that there
are no races against the reference dropping to begin with. This is the
same as any reference-counted object (for instance, see get_task_struct()).

Thanks for the review, Matthias!

Regards,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-19 08:42:44

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Hi Peter,

Peter Zijlstra wrote:
> On Fri, 2008-08-15 at 16:28 -0400, Gregory Haskins wrote:
>
>> The kernel currently addresses priority-inversion through priority-
>> inheritence. However, all of the priority-inheritence logic is
>> integrated into the Real-Time Mutex infrastructure. This causes a few
>> problems:
>>
>> 1) This tightly coupled relationship makes it difficult to extend to
>> other areas of the kernel (for instance, pi-aware wait-queues may
>> be desirable).
>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>> there is no seperation between the locking code, and the pi-code.
>>
>> This patch aims to rectify these shortcomings by designing a stand-alone
>> pi framework which can then be used to replace the rtmutex-specific
>> version. The goal of this framework is to provide similar functionality
>> to the existing subsystem, but with sole focus on PI and the
>> relationships between objects that can boost priority, and the objects
>> that get boosted.
>>
>> We introduce the concept of a "pi_source" and a "pi_sink", where, as the
>> name suggests provides the basic relationship of a priority source, and
>> its boosted target. A pi_source acts as a reference to some arbitrary
>> source of priority, and a pi_sink can be boosted (or deboosted) by
>> a pi_source. For more details, please read the library documentation.
>>
>> There are currently no users of this inteface.
>>
>
> You should have started out by discussing your design - the document
> just rambles a bit about some implementation details - it doesn't talk
> about how it maps to the PI problem space.
>

The doc is still a work-in-progress, but point taken ;) I will address
this shortly.


> Anyway - from what I can make of the code, you managed to convert the pi
> graph walking code that used to be in rt_mutex_adjust_prio_chain() and
> was iterative, into a recursive function call.
>
> Not something you should do lightly..
>

As we discussed on IRC yesterday, you are correct here. I was thinking
that the graph couldn't get deeper than a few dozen entries, but I
forgot about userspace futex access. But, this is precisely what the
"release early" policy is designed to catch ;)

I think I can make a slight adjustment to the model to return it to an
iterative design. I will address this in v5.

Thanks for the review, Peter!

Regards,
-Greg





Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-22 12:55:52

by Esben Nielsen

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Disclaimer: I am no longer actively involved and I must admit I might
have lost out on much of
what have been going on since I contributed to the PI system 2 years
ago. But I allow myself to comment
anyway.

On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
> The kernel currently addresses priority-inversion through priority-
> inheritence. However, all of the priority-inheritence logic is
> integrated into the Real-Time Mutex infrastructure. This causes a few
> problems:
>
> 1) This tightly coupled relationship makes it difficult to extend to
> other areas of the kernel (for instance, pi-aware wait-queues may
> be desirable).
> 2) Enhancing the rtmutex infrastructure becomes challenging because
> there is no seperation between the locking code, and the pi-code.
>
> This patch aims to rectify these shortcomings by designing a stand-alone
> pi framework which can then be used to replace the rtmutex-specific
> version. The goal of this framework is to provide similar functionality
> to the existing subsystem, but with sole focus on PI and the
> relationships between objects that can boost priority, and the objects
> that get boosted.

This is really a good idea. When I had time (2 years ago) to actively
work on these problem
I also came to the conclusion that PI should be more general than just
the rtmutex. Preemptive RCU
was the example which drove it.

But I do disagree that general objects should get boosted: The end
targets are always tasks. The objects might
be boosted as intermediate steps, but priority end the only applies to tasks.

I also have a few comments to the actual design:

> ....
> +
> +Multiple sinks per Node:
> +
> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
> +
> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.

This is bad from a RT point of view: You have a hard time determininig
the number of sinks per node. An rw-lock could have an arbitrary
number of readers (is supposed to really). Therefore
you have no chance of knowing how long the boost/deboost operation
will take. And you also know for how long the boosted tasks stay
boosted. If there can be an arbitrary number of
such tasks you can no longer be deterministic.

> ...
> +
> +#define MAX_PI_DEPENDENCIES 5


WHAT??? There is a finite lock depth defined. I know we did that
originally but it wasn't hardcoded (as far as I remember) and
it was certainly not as low as 5.

Remember: PI is used by the user space futeces as well!

> ....
> +/*
> + * _pi_node_update - update the chain
> + *
> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> + * that need to propagate up the chain. This is a step-wise process where we
> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
> + * times, we guarantee that this update routine is an effective barrier...
> + * all modifications made prior to the call to this barrier will have completed.
> + *
> + * Deadlock avoidance: This node may participate in a chain of nodes which
> + * form a graph of arbitrary structure. While the graph should technically
> + * never close on itself barring any bugs, we still want to protect against
> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> + * from detecting this potential). To do this, we employ a dual-locking
> + * scheme where we can carefully control the order. That is: node->lock
> + * protects most of the node's internal state, but it will never be held
> + * across a chain update. sinkref->lock, on the other hand, can be held
> + * across a boost/deboost, and also guarantees proper execution order. Also
> + * note that no locks are held across an sink->update.
> + */
> +static int
> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> +{
> + struct pi_node *node = node_of(sink);
> + struct pi_sinkref *sinkref;
> + unsigned long iflags;
> + int count = 0;
> + int i;
> + int pprio;
> + struct updater updaters[MAX_PI_DEPENDENCIES];
> +
> + spin_lock_irqsave(&node->lock, iflags);
> +
> + pprio = node->prio;
> +
> + if (!plist_head_empty(&node->srcs))
> + node->prio = plist_first(&node->srcs)->prio;
> + else
> + node->prio = MAX_PRIO;
> +
> + list_for_each_entry(sinkref, &node->sinks, list) {
> + /*
> + * If the priority is changing, or if this is a
> + * BOOST/DEBOOST, we consider this sink "stale"
> + */
> + if (pprio != node->prio
> + || sinkref->state != pi_state_boosted) {
> + struct updater *iter = &updaters[count++];

What prevents count from overrun?

> +
> + BUG_ON(!atomic_read(&sinkref->sink->refs));
> + _pi_sink_get(sinkref);
> +
> + iter->update = 1;
> + iter->sinkref = sinkref;
> + iter->sink = sinkref->sink;
> + }
> + }
> +
> + spin_unlock(&node->lock);
> +
> + for (i = 0; i < count; ++i) {
> + struct updater *iter = &updaters[i];
> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
> + struct pi_sink *sink;
> +
> + sinkref = iter->sinkref;
> + sink = iter->sink;
> +
> + spin_lock(&sinkref->lock);
> +
> + switch (sinkref->state) {
> + case pi_state_boost:
> + sinkref->state = pi_state_boosted;
> + /* Fall through */
> + case pi_state_boosted:
> + sink->ops->boost(sink, &sinkref->src, lflags);
> + break;
> + case pi_state_deboost:
> + sink->ops->deboost(sink, &sinkref->src, lflags);
> + sinkref->state = pi_state_free;
> +
> + /*
> + * drop the ref that we took when the sinkref
> + * was allocated. We still hold a ref from
> + * above.
> + */
> + _pi_sink_put_all(node, sinkref);
> + break;
> + case pi_state_free:
> + iter->update = 0;
> + break;
> + default:
> + panic("illegal sinkref type: %d", sinkref->state);
> + }
> +
> + spin_unlock(&sinkref->lock);
> +
> + /*
> + * We will drop the sinkref reference while still holding the
> + * preempt/irqs off so that the memory is returned synchronously
> + * to the system.
> + */
> + _pi_sink_put_local(node, sinkref);
> + }
> +
> + local_irq_restore(iflags);

Yack! You keep interrupts off while doing the chain. I think my main
contribution to the PI system 2 years ago was to do this preemptively.
I.e. there was points in the loop where interrupts and preemption
where turned on.

Remember: It goes into user space again. An evil user could craft an
application with a very long lock depth and keep higher priority real
time tasks from running for an arbitrary long time (if
no limit on the lock depth is set, which is bad because it will be too
low in some cases.)

But as I said I have had no time to watch what has actually been going
on in the kernel for the last 2 years roughly. The said defects might
have creeped in by other contributers already :-(

Esben

2008-08-22 13:17:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Hi Esben,
Thank you for the review. Comments inline.

Esben Nielsen wrote:
> Disclaimer: I am no longer actively involved and I must admit I might
> have lost out on much of
> what have been going on since I contributed to the PI system 2 years
> ago. But I allow myself to comment
> anyway.
>
> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
>
>> The kernel currently addresses priority-inversion through priority-
>> inheritence. However, all of the priority-inheritence logic is
>> integrated into the Real-Time Mutex infrastructure. This causes a few
>> problems:
>>
>> 1) This tightly coupled relationship makes it difficult to extend to
>> other areas of the kernel (for instance, pi-aware wait-queues may
>> be desirable).
>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>> there is no seperation between the locking code, and the pi-code.
>>
>> This patch aims to rectify these shortcomings by designing a stand-alone
>> pi framework which can then be used to replace the rtmutex-specific
>> version. The goal of this framework is to provide similar functionality
>> to the existing subsystem, but with sole focus on PI and the
>> relationships between objects that can boost priority, and the objects
>> that get boosted.
>>
>
> This is really a good idea. When I had time (2 years ago) to actively
> work on these problem
> I also came to the conclusion that PI should be more general than just
> the rtmutex. Preemptive RCU
> was the example which drove it.
>
> But I do disagree that general objects should get boosted: The end
> targets are always tasks. The objects might
> be boosted as intermediate steps, but priority end the only applies to tasks.
>
Actually I fully agree with you here. Its probably just poor wording on
my part, but this is exactly what happens. We may "boost" arbitrary
objects on the way to boosting a task...but the intermediate objects are
just there to help find our way to the proper tasks. Ultimately
everything ends up at the scheduler eventually ;)

> I also have a few comments to the actual design:
>
>
>> ....
>> +
>> +Multiple sinks per Node:
>> +
>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>> +
>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>
>
> This is bad from a RT point of view: You have a hard time determininig
> the number of sinks per node. An rw-lock could have an arbitrary
> number of readers (is supposed to really). Therefore
> you have no chance of knowing how long the boost/deboost operation
> will take. And you also know for how long the boosted tasks stay
> boosted. If there can be an arbitrary number of
> such tasks you can no longer be deterministic.
>

While you may have a valid concern about what rwlocks can do to
determinism, not that we already have PI enabled rwlocks before my
patch, so I am not increasing nor decreasing determinism in this
regard. That being said, Steven Rostedt (author of the pi-rwlocks,
CC'd) has facilities to manage this (such as limiting the number of
readers to num_online_cpus) which this design would retain. Long story
short, I do not believe I have made anything worse here, so this is a
different discussion if you are still concerned.


>
>> ...
>> +
>> +#define MAX_PI_DEPENDENCIES 5
>>
>
>
> WHAT??? There is a finite lock depth defined. I know we did that
> originally but it wasn't hardcoded (as far as I remember) and
> it was certainly not as low as 5.
>

Note that this is simply in reference to how many direct sinks you can
link to a node, not how long the resulting chain can grow. The chain
depth is actually completely unconstrained by the design. I chose "5"
here because typically we need 1 sink for the next link in the chain,
and 1 sink for local notifications. The other 3 are there for head-room
(we often hit 3-4 as we transition between nodes (add one node -> delete
another, etc).

You are not the first to comment about this, however, so it makes me
realize it is not very clear ;) I will comment the code better.


> Remember: PI is used by the user space futeces as well!
>

Yes, and on a slight tangent from your point, this incidentally is
actually a problem in the design such that I need to respin at least a
v5. My current design uses recursion against the sink->update()
methods, which Peter Zijlstra pointed out would blow up with large
userpspace chains. My next version will forgo the recursion in favor of
an iterative method more reminiscent of the original design.

>
>> ....
>> +/*
>> + * _pi_node_update - update the chain
>> + *
>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>> + * that need to propagate up the chain. This is a step-wise process where we
>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
>> + * times, we guarantee that this update routine is an effective barrier...
>> + * all modifications made prior to the call to this barrier will have completed.
>> + *
>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>> + * form a graph of arbitrary structure. While the graph should technically
>> + * never close on itself barring any bugs, we still want to protect against
>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>> + * from detecting this potential). To do this, we employ a dual-locking
>> + * scheme where we can carefully control the order. That is: node->lock
>> + * protects most of the node's internal state, but it will never be held
>> + * across a chain update. sinkref->lock, on the other hand, can be held
>> + * across a boost/deboost, and also guarantees proper execution order. Also
>> + * note that no locks are held across an sink->update.
>> + */
>> +static int
>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>> +{
>> + struct pi_node *node = node_of(sink);
>> + struct pi_sinkref *sinkref;
>> + unsigned long iflags;
>> + int count = 0;
>> + int i;
>> + int pprio;
>> + struct updater updaters[MAX_PI_DEPENDENCIES];
>> +
>> + spin_lock_irqsave(&node->lock, iflags);
>> +
>> + pprio = node->prio;
>> +
>> + if (!plist_head_empty(&node->srcs))
>> + node->prio = plist_first(&node->srcs)->prio;
>> + else
>> + node->prio = MAX_PRIO;
>> +
>> + list_for_each_entry(sinkref, &node->sinks, list) {
>> + /*
>> + * If the priority is changing, or if this is a
>> + * BOOST/DEBOOST, we consider this sink "stale"
>> + */
>> + if (pprio != node->prio
>> + || sinkref->state != pi_state_boosted) {
>> + struct updater *iter = &updaters[count++];
>>
>
> What prevents count from overrun?
>

The node->sinks list will never have more than MAX_PI_DEPs in it, by design.

>
>> +
>> + BUG_ON(!atomic_read(&sinkref->sink->refs));
>> + _pi_sink_get(sinkref);
>> +
>> + iter->update = 1;
>> + iter->sinkref = sinkref;
>> + iter->sink = sinkref->sink;
>> + }
>> + }
>> +
>> + spin_unlock(&node->lock);
>> +
>> + for (i = 0; i < count; ++i) {
>> + struct updater *iter = &updaters[i];
>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
>> + struct pi_sink *sink;
>> +
>> + sinkref = iter->sinkref;
>> + sink = iter->sink;
>> +
>> + spin_lock(&sinkref->lock);
>> +
>> + switch (sinkref->state) {
>> + case pi_state_boost:
>> + sinkref->state = pi_state_boosted;
>> + /* Fall through */
>> + case pi_state_boosted:
>> + sink->ops->boost(sink, &sinkref->src, lflags);
>> + break;
>> + case pi_state_deboost:
>> + sink->ops->deboost(sink, &sinkref->src, lflags);
>> + sinkref->state = pi_state_free;
>> +
>> + /*
>> + * drop the ref that we took when the sinkref
>> + * was allocated. We still hold a ref from
>> + * above.
>> + */
>> + _pi_sink_put_all(node, sinkref);
>> + break;
>> + case pi_state_free:
>> + iter->update = 0;
>> + break;
>> + default:
>> + panic("illegal sinkref type: %d", sinkref->state);
>> + }
>> +
>> + spin_unlock(&sinkref->lock);
>> +
>> + /*
>> + * We will drop the sinkref reference while still holding the
>> + * preempt/irqs off so that the memory is returned synchronously
>> + * to the system.
>> + */
>> + _pi_sink_put_local(node, sinkref);
>> + }
>> +
>> + local_irq_restore(iflags);
>>
>
> Yack! You keep interrupts off while doing the chain.

Actually, not quite. The first pass (with interrupts off) simply sets
the new priority value at each local element (limited to 5, typically
1-2). Short and sweet. Its the "update" that happens next (with
interrupts/preemption enabled) that updates the chain.


> I think my main
> contribution to the PI system 2 years ago was to do this preemptively.
> I.e. there was points in the loop where interrupts and preemption
> where turned on.
>

I agree this is important, but I think you will see with further review
that this is in fact what I do too.

> Remember: It goes into user space again. An evil user could craft an
> application with a very long lock depth and keep higher priority real
> time tasks from running for an arbitrary long time (if
> no limit on the lock depth is set, which is bad because it will be too
> low in some cases.)
>
> But as I said I have had no time to watch what has actually been going
> on in the kernel for the last 2 years roughly. The said defects might
> have creeped in by other contributers already :-(
>
> Esben
>

Esben,
Your review and insight are very much appreciated. I will be sure to
address the concerns mentioned above and CC you on the next release.

Thanks again,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-22 13:17:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface



On Fri, 22 Aug 2008, Esben Nielsen wrote:

> Disclaimer: I am no longer actively involved and I must admit I might
> have lost out on much of
> what have been going on since I contributed to the PI system 2 years
> ago. But I allow myself to comment
> anyway.

Esben, you are always welcomed. You are one of the copyright owners of
rtmutex.c ;-)

>
> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
> > The kernel currently addresses priority-inversion through priority-
> > inheritence. However, all of the priority-inheritence logic is
> > integrated into the Real-Time Mutex infrastructure. This causes a few
> > problems:
> >
> > 1) This tightly coupled relationship makes it difficult to extend to
> > other areas of the kernel (for instance, pi-aware wait-queues may
> > be desirable).
> > 2) Enhancing the rtmutex infrastructure becomes challenging because
> > there is no seperation between the locking code, and the pi-code.
> >
> > This patch aims to rectify these shortcomings by designing a stand-alone
> > pi framework which can then be used to replace the rtmutex-specific
> > version. The goal of this framework is to provide similar functionality
> > to the existing subsystem, but with sole focus on PI and the
> > relationships between objects that can boost priority, and the objects
> > that get boosted.
>
> This is really a good idea. When I had time (2 years ago) to actively
> work on these problem
> I also came to the conclusion that PI should be more general than just
> the rtmutex. Preemptive RCU
> was the example which drove it.
>
> But I do disagree that general objects should get boosted: The end
> targets are always tasks. The objects might
> be boosted as intermediate steps, but priority end the only applies to tasks.
>
> I also have a few comments to the actual design:
>
> > ....
> > +
> > +Multiple sinks per Node:
> > +
> > +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
> > +
> > +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>
> This is bad from a RT point of view: You have a hard time determininig
> the number of sinks per node. An rw-lock could have an arbitrary
> number of readers (is supposed to really). Therefore
> you have no chance of knowing how long the boost/deboost operation
> will take. And you also know for how long the boosted tasks stay
> boosted. If there can be an arbitrary number of
> such tasks you can no longer be deterministic.
>
> > ...
> > +
> > +#define MAX_PI_DEPENDENCIES 5
>
>
> WHAT??? There is a finite lock depth defined. I know we did that
> originally but it wasn't hardcoded (as far as I remember) and
> it was certainly not as low as 5.

Yeah, I believe our number is 1024, and is not hardcoded, but is there
to detect recursive locks.

>
> Remember: PI is used by the user space futeces as well!

I haven't looked to hard at this code yet, but this may only be kernel
related for multiple owners (see my explanaiton below).

>
> > ....
> > +/*
> > + * _pi_node_update - update the chain
> > + *
> > + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> > + * that need to propagate up the chain. This is a step-wise process where we
> > + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
> > + * times, we guarantee that this update routine is an effective barrier...
> > + * all modifications made prior to the call to this barrier will have completed.
> > + *
> > + * Deadlock avoidance: This node may participate in a chain of nodes which
> > + * form a graph of arbitrary structure. While the graph should technically
> > + * never close on itself barring any bugs, we still want to protect against
> > + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> > + * from detecting this potential). To do this, we employ a dual-locking
> > + * scheme where we can carefully control the order. That is: node->lock
> > + * protects most of the node's internal state, but it will never be held
> > + * across a chain update. sinkref->lock, on the other hand, can be held
> > + * across a boost/deboost, and also guarantees proper execution order. Also
> > + * note that no locks are held across an sink->update.
> > + */
> > +static int
> > +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> > +{
> > + struct pi_node *node = node_of(sink);
> > + struct pi_sinkref *sinkref;
> > + unsigned long iflags;
> > + int count = 0;
> > + int i;
> > + int pprio;
> > + struct updater updaters[MAX_PI_DEPENDENCIES];
> > +
> > + spin_lock_irqsave(&node->lock, iflags);
> > +
> > + pprio = node->prio;
> > +
> > + if (!plist_head_empty(&node->srcs))
> > + node->prio = plist_first(&node->srcs)->prio;
> > + else
> > + node->prio = MAX_PRIO;
> > +
> > + list_for_each_entry(sinkref, &node->sinks, list) {
> > + /*
> > + * If the priority is changing, or if this is a
> > + * BOOST/DEBOOST, we consider this sink "stale"
> > + */
> > + if (pprio != node->prio
> > + || sinkref->state != pi_state_boosted) {
> > + struct updater *iter = &updaters[count++];
>
> What prevents count from overrun?
>
> > +
> > + BUG_ON(!atomic_read(&sinkref->sink->refs));
> > + _pi_sink_get(sinkref);
> > +
> > + iter->update = 1;
> > + iter->sinkref = sinkref;
> > + iter->sink = sinkref->sink;
> > + }
> > + }
> > +
> > + spin_unlock(&node->lock);
> > +
> > + for (i = 0; i < count; ++i) {
> > + struct updater *iter = &updaters[i];
> > + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
> > + struct pi_sink *sink;
> > +
> > + sinkref = iter->sinkref;
> > + sink = iter->sink;
> > +
> > + spin_lock(&sinkref->lock);
> > +
> > + switch (sinkref->state) {
> > + case pi_state_boost:
> > + sinkref->state = pi_state_boosted;
> > + /* Fall through */
> > + case pi_state_boosted:
> > + sink->ops->boost(sink, &sinkref->src, lflags);
> > + break;
> > + case pi_state_deboost:
> > + sink->ops->deboost(sink, &sinkref->src, lflags);
> > + sinkref->state = pi_state_free;
> > +
> > + /*
> > + * drop the ref that we took when the sinkref
> > + * was allocated. We still hold a ref from
> > + * above.
> > + */
> > + _pi_sink_put_all(node, sinkref);
> > + break;
> > + case pi_state_free:
> > + iter->update = 0;
> > + break;
> > + default:
> > + panic("illegal sinkref type: %d", sinkref->state);
> > + }
> > +
> > + spin_unlock(&sinkref->lock);
> > +
> > + /*
> > + * We will drop the sinkref reference while still holding the
> > + * preempt/irqs off so that the memory is returned synchronously
> > + * to the system.
> > + */
> > + _pi_sink_put_local(node, sinkref);
> > + }
> > +
> > + local_irq_restore(iflags);
>
> Yack! You keep interrupts off while doing the chain. I think my main
> contribution to the PI system 2 years ago was to do this preemptively.
> I.e. there was points in the loop where interrupts and preemption
> where turned on.
>
> Remember: It goes into user space again. An evil user could craft an
> application with a very long lock depth and keep higher priority real
> time tasks from running for an arbitrary long time (if
> no limit on the lock depth is set, which is bad because it will be too
> low in some cases.)
>
> But as I said I have had no time to watch what has actually been going
> on in the kernel for the last 2 years roughly. The said defects might
> have creeped in by other contributers already :-(

The rtmutex.c has hardly changed since you last left it. The two big
additions, were adaptive locks, which hardly touched the pi chain, and my
rwlocks allowing multiple readers. It added a hook to allow going into the
pi chain for all readers while holding a spinlock and yes irqs off. The
difference is that it is a bug to hold an rwlock (internal kernel lock
only) and take a futex. Thus, this rwlock code did have a recursive depth
of 5. Perhaps that's what the PI depth is from above?

I still haven't had the time to analyze Gregory's code, so those points
that you made, may be only related to kernel activities (like the new
rwlock code). But by generalizing it, it decouples the PI from the
locking, which in general is a good thing, but for the multiple reader
locks, it is dangerous to decouple it, since there are a lot of
assumptions between the multiple PI owner code and rwlocks. In a general
approach, the assumptions will be harder to see.

-- Steve

2008-08-22 16:10:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Gregory Haskins wrote:
> Hi Esben,
> Thank you for the review. Comments inline.
>
> Esben Nielsen wrote:
>
>> Disclaimer: I am no longer actively involved and I must admit I might
>> have lost out on much of
>> what have been going on since I contributed to the PI system 2 years
>> ago. But I allow myself to comment
>> anyway.
>>
>> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
>>
>>
>>> The kernel currently addresses priority-inversion through priority-
>>> inheritence. However, all of the priority-inheritence logic is
>>> integrated into the Real-Time Mutex infrastructure. This causes a few
>>> problems:
>>>
>>> 1) This tightly coupled relationship makes it difficult to extend to
>>> other areas of the kernel (for instance, pi-aware wait-queues may
>>> be desirable).
>>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>>> there is no seperation between the locking code, and the pi-code.
>>>
>>> This patch aims to rectify these shortcomings by designing a stand-alone
>>> pi framework which can then be used to replace the rtmutex-specific
>>> version. The goal of this framework is to provide similar functionality
>>> to the existing subsystem, but with sole focus on PI and the
>>> relationships between objects that can boost priority, and the objects
>>> that get boosted.
>>>
>>>
>> This is really a good idea. When I had time (2 years ago) to actively
>> work on these problem
>> I also came to the conclusion that PI should be more general than just
>> the rtmutex. Preemptive RCU
>> was the example which drove it.
>>
>> But I do disagree that general objects should get boosted: The end
>> targets are always tasks. The objects might
>> be boosted as intermediate steps, but priority end the only applies to tasks.
>>
>>
> Actually I fully agree with you here. Its probably just poor wording on
> my part, but this is exactly what happens. We may "boost" arbitrary
> objects on the way to boosting a task...but the intermediate objects are
> just there to help find our way to the proper tasks. Ultimately
> everything ends up at the scheduler eventually ;)
>
>
>> I also have a few comments to the actual design:
>>
>>
>>
>>> ....
>>> +
>>> +Multiple sinks per Node:
>>> +
>>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>>> +
>>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>>
>>>
>> This is bad from a RT point of view: You have a hard time determininig
>> the number of sinks per node. An rw-lock could have an arbitrary
>> number of readers (is supposed to really). Therefore
>> you have no chance of knowing how long the boost/deboost operation
>> will take. And you also know for how long the boosted tasks stay
>> boosted. If there can be an arbitrary number of
>> such tasks you can no longer be deterministic.
>>
>>
>
> While you may have a valid concern about what rwlocks can do to
> determinism, not that we already have PI enabled rwlocks before my
> patch, so I am not increasing nor decreasing determinism in this
> regard. That being said, Steven Rostedt (author of the pi-rwlocks,
> CC'd) has facilities to manage this (such as limiting the number of
> readers to num_online_cpus) which this design would retain. Long story
> short, I do not believe I have made anything worse here, so this is a
> different discussion if you are still concerned.
>
>
>
>>
>>
>>> ...
>>> +
>>> +#define MAX_PI_DEPENDENCIES 5
>>>
>>>
>> WHAT??? There is a finite lock depth defined. I know we did that
>> originally but it wasn't hardcoded (as far as I remember) and
>> it was certainly not as low as 5.
>>
>>
>
> Note that this is simply in reference to how many direct sinks you can
> link to a node, not how long the resulting chain can grow. The chain
> depth is actually completely unconstrained by the design. I chose "5"
> here because typically we need 1 sink for the next link in the chain,
> and 1 sink for local notifications. The other 3 are there for head-room
> (we often hit 3-4 as we transition between nodes (add one node -> delete
> another, etc).
>

To clarify what I meant here: If you think of a normal linked-list node
having a single "next" pointer, this implementation is like each node
having up to 5 "next" pointers. However typically only 1-2 are used,
and all but one will usually point to a "leaf" node, meaning it does not
form a chain but terminates processing locally. Typically there will be
only one link to something that forms a chain with other nodes. I did
this because I realized the pattern (boost/deboost/update) was similar
whether the node was a leaf or a chain-link, so I unified both behind
the single pi_sink interface.

That being understood, note that as with any linked-list, the nodes can
still have an arbitrary chaining depth (and I will fix this to be
iterative instead of recursive, as previously mentioned).

> You are not the first to comment about this, however, so it makes me
> realize it is not very clear ;) I will comment the code better.
>
>
>
>> Remember: PI is used by the user space futeces as well!
>>
>>
>
> Yes, and on a slight tangent from your point, this incidentally is
> actually a problem in the design such that I need to respin at least a
> v5. My current design uses recursion against the sink->update()
> methods, which Peter Zijlstra pointed out would blow up with large
> userpspace chains. My next version will forgo the recursion in favor of
> an iterative method more reminiscent of the original design.
>
>
>>
>>
>>> ....
>>> +/*
>>> + * _pi_node_update - update the chain
>>> + *
>>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>>> + * that need to propagate up the chain. This is a step-wise process where we
>>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
>>> + * times, we guarantee that this update routine is an effective barrier...
>>> + * all modifications made prior to the call to this barrier will have completed.
>>> + *
>>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>>> + * form a graph of arbitrary structure. While the graph should technically
>>> + * never close on itself barring any bugs, we still want to protect against
>>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>>> + * from detecting this potential). To do this, we employ a dual-locking
>>> + * scheme where we can carefully control the order. That is: node->lock
>>> + * protects most of the node's internal state, but it will never be held
>>> + * across a chain update. sinkref->lock, on the other hand, can be held
>>> + * across a boost/deboost, and also guarantees proper execution order. Also
>>> + * note that no locks are held across an sink->update.
>>> + */
>>> +static int
>>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>>> +{
>>> + struct pi_node *node = node_of(sink);
>>> + struct pi_sinkref *sinkref;
>>> + unsigned long iflags;
>>> + int count = 0;
>>> + int i;
>>> + int pprio;
>>> + struct updater updaters[MAX_PI_DEPENDENCIES];
>>> +
>>> + spin_lock_irqsave(&node->lock, iflags);
>>> +
>>> + pprio = node->prio;
>>> +
>>> + if (!plist_head_empty(&node->srcs))
>>> + node->prio = plist_first(&node->srcs)->prio;
>>> + else
>>> + node->prio = MAX_PRIO;
>>> +
>>> + list_for_each_entry(sinkref, &node->sinks, list) {
>>> + /*
>>> + * If the priority is changing, or if this is a
>>> + * BOOST/DEBOOST, we consider this sink "stale"
>>> + */
>>> + if (pprio != node->prio
>>> + || sinkref->state != pi_state_boosted) {
>>> + struct updater *iter = &updaters[count++];
>>>
>>>
>> What prevents count from overrun?
>>
>>
>
> The node->sinks list will never have more than MAX_PI_DEPs in it, by design.
>
>
>>
>>
>>> +
>>> + BUG_ON(!atomic_read(&sinkref->sink->refs));
>>> + _pi_sink_get(sinkref);
>>> +
>>> + iter->update = 1;
>>> + iter->sinkref = sinkref;
>>> + iter->sink = sinkref->sink;
>>> + }
>>> + }
>>> +
>>> + spin_unlock(&node->lock);
>>> +
>>> + for (i = 0; i < count; ++i) {
>>> + struct updater *iter = &updaters[i];
>>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
>>> + struct pi_sink *sink;
>>> +
>>> + sinkref = iter->sinkref;
>>> + sink = iter->sink;
>>> +
>>> + spin_lock(&sinkref->lock);
>>> +
>>> + switch (sinkref->state) {
>>> + case pi_state_boost:
>>> + sinkref->state = pi_state_boosted;
>>> + /* Fall through */
>>> + case pi_state_boosted:
>>> + sink->ops->boost(sink, &sinkref->src, lflags);
>>> + break;
>>> + case pi_state_deboost:
>>> + sink->ops->deboost(sink, &sinkref->src, lflags);
>>> + sinkref->state = pi_state_free;
>>> +
>>> + /*
>>> + * drop the ref that we took when the sinkref
>>> + * was allocated. We still hold a ref from
>>> + * above.
>>> + */
>>> + _pi_sink_put_all(node, sinkref);
>>> + break;
>>> + case pi_state_free:
>>> + iter->update = 0;
>>> + break;
>>> + default:
>>> + panic("illegal sinkref type: %d", sinkref->state);
>>> + }
>>> +
>>> + spin_unlock(&sinkref->lock);
>>> +
>>> + /*
>>> + * We will drop the sinkref reference while still holding the
>>> + * preempt/irqs off so that the memory is returned synchronously
>>> + * to the system.
>>> + */
>>> + _pi_sink_put_local(node, sinkref);
>>> + }
>>> +
>>> + local_irq_restore(iflags);
>>>
>>>
>> Yack! You keep interrupts off while doing the chain.
>>
>
> Actually, not quite. The first pass (with interrupts off) simply sets
> the new priority value at each local element (limited to 5, typically
> 1-2). Short and sweet. Its the "update" that happens next (with
> interrupts/preemption enabled) that updates the chain.
>
>
>
>> I think my main
>> contribution to the PI system 2 years ago was to do this preemptively.
>> I.e. there was points in the loop where interrupts and preemption
>> where turned on.
>>
>>
>
> I agree this is important, but I think you will see with further review
> that this is in fact what I do too.
>
>
>> Remember: It goes into user space again. An evil user could craft an
>> application with a very long lock depth and keep higher priority real
>> time tasks from running for an arbitrary long time (if
>> no limit on the lock depth is set, which is bad because it will be too
>> low in some cases.)
>>
>> But as I said I have had no time to watch what has actually been going
>> on in the kernel for the last 2 years roughly. The said defects might
>> have creeped in by other contributers already :-(
>>
>> Esben
>>
>>
>
> Esben,
> Your review and insight are very much appreciated. I will be sure to
> address the concerns mentioned above and CC you on the next release.
>
> Thanks again,
> -Greg
>
>
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature