2008-02-25 16:29:25

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 0/9] adaptive real-time locks

You can download this series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v2.tar.bz2

Changes since v1:

*) Rebased from 24-rt1 to 24.2-rt2
*) Dropped controversial (and likely unecessary) printk patch
*) Dropped (internally) controversial PREEMPT_SPINLOCK_WAITERS config options
*) Incorporated review feedback for comment/config cleanup from Pavel/PeterZ
*) Moved lateral-steal to front of queue
*) Fixed compilation issue with !defined(LATERAL_STEAL)
*) Moved spinlock rework into a separate series:
ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Todo:
*) Convert loop based timeouts to use nanoseconds
*) Tie into lockstat infrastructure.
*) Long-term: research adaptive-timeout algorithms so a fixed/one-size-
-fits-all value is not necessary.

------------------------

Adaptive real-time locks

The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
"spinlocks" to an "RT mutex" implementation that support a priority
inheritance protocol, and priority-ordered wait queues. The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way. This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel. It applies to 2.6.24-rt1.

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency. For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time. This works because
most locks are typically held for very short time spans. Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490.

Some tests show a marked improvement (for instance, ~450% more throughput
for dbench, and ~500% faster for hackbench), whereas some others
(make -j 128) the results were not as profound but they were still
net-positive. In all cases we have also verified that deterministic
latency is not impacted by using cyclic-test.

This patch series depends on some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks. We found that
the increased pressure on the lock->wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation. Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25). Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

You can find this re-work as a separate series here:

ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Special thanks go to many people who were instrumental to this project,
including:
*) the -rt team here at Novell for research, development, and testing.
*) Nick Piggin for his invaluable consultation/feedback and use of his
x86-ticket-locks.
*) The reviewers/testers at Suse, Montavista, and Bill Huey for their
time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg


2008-02-25 16:29:43

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority

The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority. The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes. But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block. However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio). This helps to keep the system "primed" with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/Kconfig.preempt | 10 ++++++++++
kernel/rtmutex.c | 31 +++++++++++++++++++++++--------
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..e493257 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -196,3 +196,13 @@ config SPINLOCK_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+config RTLOCK_LATERAL_STEAL
+ bool "Allow equal-priority rtlock stealing"
+ default y
+ depends on PREEMPT_RT
+ help
+ This option alters the rtlock lock-stealing logic to allow
+ equal priority tasks to preempt a pending owner in addition
+ to higher priority tasks. This allows for a significant
+ boost in throughput under certain circumstances at the expense
+ of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..6624c66 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
return ret;
}

+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+ if (current->prio >= pendowner->prio)
+#else
+ if (current->prio > pendowner->prio)
+ return 0;
+
+ if (!unfair && (current->prio == pendowner->prio))
+#endif
+ return 0;
+
+ return 1;
+}
+
/*
* Optimization: check if we can steal the lock from the
* assigned pending owner [which might not have taken the
* lock yet]:
*/
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;

spin_lock(&pendowner->pi_lock);
- if (current->prio >= pendowner->prio) {
+ if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(&pendowner->pi_lock);
return 0;
}
@@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
*
* Must be called with lock->wait_lock held.
*/
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
{
/*
* We have to be careful here if the atomic speedups are
@@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
mark_rt_mutex_waiters(lock);

- if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+ if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair))
return 0;

/* We got the lock. */
@@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current->lock_depth;

/* Try to acquire the lock */
- if (try_to_take_rt_mutex(lock))
+ if (try_to_take_rt_mutex(lock, 1))
break;
/*
* waiter.task is NULL the first time we come here and
@@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);

/* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
+ if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(&lock->wait_lock, flags);
return 0;
}
@@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_flags;

/* Try to acquire the lock: */
- if (try_to_take_rt_mutex(lock))
+ if (try_to_take_rt_mutex(lock, 0))
break;

/*
@@ -1078,7 +1093,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)

init_lists(lock);

- ret = try_to_take_rt_mutex(lock);
+ ret = try_to_take_rt_mutex(lock, 0);
/*
* try_to_take_rt_mutex() sets the lock waiters
* bit unconditionally. Clean this up.

2008-02-25 16:29:58

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

From: Sven-Thorsten Dietrich <[email protected]>

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich <[email protected]>
---

kernel/rtmutex.c | 7 ++++++-
kernel/sysctl.c | 14 ++++++++++++++
2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 6624c66..cd39c26 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -18,6 +18,10 @@

#include "rtmutex_common.h"

+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
+
/*
* lock->owner state tracking:
*
@@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
if (current->prio > pendowner->prio)
return 0;

- if (!unfair && (current->prio == pendowner->prio))
+ if (unlikely(current->prio == pendowner->prio) &&
+ !(unfair && rtmutex_lateral_steal))
#endif
return 0;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c913d48..c24c53d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
int sysctl_legacy_va_layout;
#endif

+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
extern int prove_locking;
extern int lock_stat;

@@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtmutex_lateral_steal",
+ .data = &rtmutex_lateral_steal,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-25 16:30:28

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

The current logic makes rather coarse adjustments to current->state since
it is planning on sleeping anyway. We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule(). This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 20 +++++++++++++++-----
1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cd39c26..ef52db6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
}

+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+ unsigned long state = xchg(&current->state, new_state);
+ if (unlikely(state == TASK_RUNNING))
+ *saved_state = TASK_RUNNING;
+}
+
/*
* Slow path lock function spin_lock style: this variant is very
* careful not to miss any non-lock wakeups.
@@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* saved_state accordingly. If we did not get a real wakeup
* then we return with the saved state.
*/
- saved_state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
+ saved_state = current->state;
+ smp_mb();

for (;;) {
unsigned long saved_flags;
@@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)

debug_rt_mutex_print_deadlock(&waiter);

- schedule_rt_mutex(lock);
+ update_current(TASK_UNINTERRUPTIBLE, &saved_state);
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ else
+ update_current(TASK_RUNNING_MUTEX, &saved_state);

spin_lock_irqsave(&lock->wait_lock, flags);
current->flags |= saved_flags;
current->lock_depth = saved_lock_depth;
- state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
- if (unlikely(state == TASK_RUNNING))
- saved_state = TASK_RUNNING;
}

state = xchg(&current->state, saved_state);

2008-02-25 16:30:44

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 4/9] optimize rt lock wakeup

It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins <[email protected]>
Signed-off-by: Peter Morreale <[email protected]>
---

kernel/rtmutex.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ef52db6..bf9e230 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -531,6 +531,41 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
pendowner = waiter->task;
waiter->task = NULL;

+ /*
+ * Do the wakeup before the ownership change to give any spinning
+ * waiter grantees a headstart over the other threads that will
+ * trigger once owner changes.
+ */
+ if (!savestate)
+ wake_up_process(pendowner);
+ else {
+ /*
+ * We can skip the actual (expensive) wakeup if the
+ * waiter is already running, but we have to be careful
+ * of race conditions because they may be about to sleep.
+ *
+ * The waiter-side protocol has the following pattern:
+ * 1: Set state != RUNNING
+ * 2: Conditionally sleep if waiter->task != NULL;
+ *
+ * And the owner-side has the following:
+ * A: Set waiter->task = NULL
+ * B: Conditionally wake if the state != RUNNING
+ *
+ * As long as we ensure 1->2 order, and A->B order, we
+ * will never miss a wakeup.
+ *
+ * Therefore, this barrier ensures that waiter->task = NULL
+ * is visible before we test the pendowner->state. The
+ * corresponding barrier is in the sleep logic.
+ */
+ smp_mb();
+
+ if ((pendowner->state != TASK_RUNNING)
+ && (pendowner->state != TASK_RUNNING_MUTEX))
+ wake_up_process_mutex(pendowner);
+ }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);

spin_unlock(&current->pi_lock);
@@ -557,11 +592,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
}
spin_unlock(&pendowner->pi_lock);
-
- if (savestate)
- wake_up_process_mutex(pendowner);
- else
- wake_up_process(pendowner);
}

/*
@@ -762,6 +792,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(&waiter);

update_current(TASK_UNINTERRUPTIBLE, &saved_state);
+ /*
+ * The xchg() in update_current() is an implicit barrier
+ * which we rely upon to ensure current->state is visible
+ * before we test waiter.task.
+ */
if (waiter.task)
schedule_rt_mutex(lock);
else

2008-02-25 16:31:00

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping). Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks. However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks. This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention. Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent. Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem. They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins <[email protected]>
Signed-off-by: Peter Morreale <[email protected]>
Signed-off-by: Sven Dietrich <[email protected]>
---

kernel/Kconfig.preempt | 20 +++++++
kernel/rtmutex.c | 30 +++++++---
kernel/rtmutex_adaptive.h | 138 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e493257..d2432fa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -183,6 +183,26 @@ config RCU_TRACE
Say Y/M here if you want to enable RCU tracing in-kernel/module.
Say N if you are unsure.

+config ADAPTIVE_RTLOCK
+ bool "Adaptive real-time locks"
+ default y
+ depends on PREEMPT_RT && SMP
+ help
+ PREEMPT_RT allows for greater determinism by transparently
+ converting normal spinlock_ts into preemptible rtmutexes which
+ sleep any waiters under contention. However, in many cases the
+ lock will be released in less time than it takes to context
+ switch. Therefore, the "sleep under contention" policy may also
+ degrade throughput performance due to the extra context switches.
+
+ This option alters the rtmutex derived spinlock_t replacement
+ code to use an adaptive spin/sleep algorithm. It will spin
+ unless it determines it must sleep to avoid deadlock. This
+ offers a best of both worlds solution since we achieve both
+ high-throughput and low-latency.
+
+ If unsure, say Y.
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index bf9e230..3802ef8 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,8 @@
* Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[email protected]>
* Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
* Copyright (C) 2006 Esben Nielsen
+ * Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale,
+ * and Gregory Haskins
*
* See Documentation/rt-mutex-design.txt for details.
*/
@@ -17,6 +19,7 @@
#include <linux/hardirq.h>

#include "rtmutex_common.h"
+#include "rtmutex_adaptive.h"

#ifdef CONFIG_RTLOCK_LATERAL_STEAL
int rtmutex_lateral_steal __read_mostly = 1;
@@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
{
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+ DECLARE_ADAPTIVE_WAITER(adaptive);

debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}

+ prepare_adaptive_wait(lock, &adaptive);
+
/*
* Prevent schedule() to drop BKL, while waiting for
* the lock ! We restore lock_depth when we come back.
@@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)

debug_rt_mutex_print_deadlock(&waiter);

- update_current(TASK_UNINTERRUPTIBLE, &saved_state);
- /*
- * The xchg() in update_current() is an implicit barrier
- * which we rely upon to ensure current->state is visible
- * before we test waiter.task.
- */
- if (waiter.task)
- schedule_rt_mutex(lock);
- else
- update_current(TASK_RUNNING_MUTEX, &saved_state);
+ /* adaptive_wait() returns 1 if we need to sleep */
+ if (adaptive_wait(lock, &waiter, &adaptive)) {
+ update_current(TASK_UNINTERRUPTIBLE, &saved_state);
+ /*
+ * The xchg() in update_current() is an implicit
+ * barrier which we rely upon to ensure current->state
+ * is visible before we test waiter.task.
+ */
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ else
+ update_current(TASK_RUNNING_MUTEX,
+ &saved_state);
+ }

spin_lock_irqsave(&lock->wait_lock, flags);
current->flags |= saved_flags;
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
new file mode 100644
index 0000000..862c088
--- /dev/null
+++ b/kernel/rtmutex_adaptive.h
@@ -0,0 +1,138 @@
+/*
+ * Adaptive RT lock support
+ *
+ * There are pros and cons when deciding between the two basic forms of
+ * locking primitives (spinning vs sleeping). Without going into great
+ * detail on either one, we note that spinlocks have the advantage of
+ * lower overhead for short hold locks. However, they also have a
+ * con in that they create indeterminate latencies since preemption
+ * must traditionally be disabled while the lock is held (to prevent deadlock).
+ *
+ * We want to avoid non-deterministic critical sections in -rt. Therefore,
+ * when realtime is enabled, most contexts are converted to threads, and
+ * likewise most spinlock_ts are converted to sleepable rt-mutex derived
+ * locks. This allows the holder of the lock to remain fully preemptible,
+ * thus reducing a major source of latencies in the kernel.
+ *
+ * However, converting what was once a true spinlock into a sleeping lock
+ * may also decrease performance since the locks will now sleep under
+ * contention. Since the fundamental lock used to be a spinlock, it is
+ * highly likely that it was used in a short-hold path and that release
+ * is imminent. Therefore sleeping only serves to cause context-thrashing.
+ *
+ * Adaptive RT locks use a hybrid approach to solve the problem. They
+ * spin when possible, and sleep when necessary (to avoid deadlock, etc).
+ * This significantly improves many areas of the performance of the -rt
+ * kernel.
+ *
+ * Copyright (C) 2008 Novell, Inc.,
+ * Sven Dietrich, Peter Morreale, and Gregory Haskins
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef __KERNEL_RTMUTEX_ADAPTIVE_H
+#define __KERNEL_RTMUTEX_ADAPTIVE_H
+
+#include "rtmutex_common.h"
+
+
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+struct adaptive_waiter {
+ struct task_struct *owner;
+};
+
+/*
+ * Adaptive-rtlocks will busywait when possible, and sleep only if
+ * necessary. Note that the busyloop looks racy, and it is....but we do
+ * not care. If we lose any races it simply means that we spin one more
+ * time before seeing that we need to break-out on the next iteration.
+ *
+ * We realize this is a relatively large function to inline, but note that
+ * it is only instantiated 1 or 2 times max, and it makes a measurable
+ * performance different to avoid the call.
+ *
+ * Returns 1 if we should sleep
+ *
+ */
+static inline int
+adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
+ struct adaptive_waiter *adaptive)
+{
+ int sleep = 0;
+
+ for (;;) {
+ /*
+ * If the task was re-awoken, break out completely so we can
+ * reloop through the lock-acquisition code.
+ */
+ if (!waiter->task)
+ break;
+
+ /*
+ * We need to break if the owner changed so we can reloop
+ * and safely acquire the owner-pointer again with the
+ * wait_lock held.
+ */
+ if (adaptive->owner != rt_mutex_owner(lock))
+ break;
+
+ /*
+ * If we got here, presumably the lock ownership is still
+ * current. We will use it to our advantage to be able to
+ * spin without disabling preemption...
+ */
+
+ /*
+ * .. sleep if the owner is not running..
+ */
+ if (!adaptive->owner->se.on_rq) {
+ sleep = 1;
+ break;
+ }
+
+ /*
+ * .. or is running on our own cpu (to prevent deadlock)
+ */
+ if (task_cpu(adaptive->owner) == task_cpu(current)) {
+ sleep = 1;
+ break;
+ }
+
+ cpu_relax();
+ }
+
+ put_task_struct(adaptive->owner);
+
+ return sleep;
+}
+
+static inline void
+prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive)
+{
+ /*
+ * We must acquire/lock the owner pointer while holding
+ * the wait_lock, or we risk racing against the owner
+ * exiting.
+ */
+ adaptive->owner = rt_mutex_owner(lock);
+ get_task_struct(adaptive->owner);
+}
+
+#define DECLARE_ADAPTIVE_WAITER(name) \
+ struct adaptive_waiter name = { .owner = NULL, }
+
+#else
+
+#define DECLARE_ADAPTIVE_WAITER(name)
+
+#define adaptive_wait(lock, waiter, busy) 1
+#define prepare_adaptive_wait(lock, busy) {}
+
+#endif /* CONFIG_ADAPTIVE_RTLOCK */
+
+
+#endif /* __KERNEL_RTMUTEX_ADAPTIVE_H */

2008-02-25 16:31:25

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

From: Sven Dietrich <[email protected]>

Signed-off-by: Sven Dietrich <[email protected]>
---

kernel/Kconfig.preempt | 11 +++++++++++
kernel/rtmutex.c | 4 ++++
kernel/rtmutex_adaptive.h | 11 +++++++++--
kernel/sysctl.c | 12 ++++++++++++
4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2432fa..ac1cbad 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK

If unsure, say Y.

+config RTLOCK_DELAY
+ int "Default delay (in loops) for adaptive rtlocks"
+ range 0 1000000000
+ depends on ADAPTIVE_RTLOCK
+ default "10000"
+ help
+ This allows you to specify the maximum attempts a task will spin
+ attempting to acquire an rtlock before sleeping. The value is
+ tunable at runtime via a sysctl. A setting of 0 (zero) disables
+ the adaptive algorithm entirely.
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 3802ef8..4a16b13 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -25,6 +25,10 @@
int rtmutex_lateral_steal __read_mostly = 1;
#endif

+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
/*
* lock->owner state tracking:
*
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 862c088..60c6328 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -43,6 +43,7 @@
#ifdef CONFIG_ADAPTIVE_RTLOCK
struct adaptive_waiter {
struct task_struct *owner;
+ int timeout;
};

/*
@@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
{
int sleep = 0;

- for (;;) {
+ for (; adaptive->timeout > 0; adaptive->timeout--) {
/*
* If the task was re-awoken, break out completely so we can
* reloop through the lock-acquisition code.
@@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
cpu_relax();
}

+ if (adaptive->timeout <= 0)
+ sleep = 1;
+
put_task_struct(adaptive->owner);

return sleep;
@@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive)
get_task_struct(adaptive->owner);
}

+extern int rtlock_timeout;
+
#define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL, \
+ .timeout = rtlock_timeout, }

#else

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c24c53d..55189ea 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -56,6 +56,8 @@
#include <asm/stacktrace.h>
#endif

+#include "rtmutex_adaptive.h"
+
static int deprecated_sysctl_warning(struct __sysctl_args *args);

#if defined(CONFIG_SYSCTL)
@@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtlock_timeout",
+ .data = &rtlock_timeout,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-25 16:31:39

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 7/9] adaptive mutexes

From: Peter W.Morreale <[email protected]>

This patch adds the adaptive spin lock busywait to rtmutexes. It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/Kconfig.preempt | 37 ++++++++++++++++++++++
kernel/rtmutex.c | 76 +++++++++++++++++++++++++--------------------
kernel/rtmutex_adaptive.h | 32 ++++++++++++++++++-
kernel/sysctl.c | 10 ++++++
4 files changed, 119 insertions(+), 36 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index ac1cbad..864bf14 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -214,6 +214,43 @@ config RTLOCK_DELAY
tunable at runtime via a sysctl. A setting of 0 (zero) disables
the adaptive algorithm entirely.

+config ADAPTIVE_RTMUTEX
+ bool "Adaptive real-time mutexes"
+ default y
+ depends on ADAPTIVE_RTLOCK
+ help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes. In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation. In the event the critical
+ section does not need to sleep, an additional gain in throughput
+ can be seen by avoiding the extra overhead of sleeping.
+
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm. It will spin unless it determines it must
+ sleep to avoid deadlock. This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+
+ If unsure, say Y
+
+config RTMUTEX_DELAY
+ int "Default delay (in loops) for adaptive mutexes"
+ range 0 10000000
+ depends on ADAPTIVE_RTMUTEX
+ default "3000"
+ help
+ This allows you to specify the maximum delay a task will use
+ to wait for a rt mutex before going to sleep. Note that that
+ although the delay is implemented as a preemptable loop, tasks
+ of like priority cannot preempt each other and this setting can
+ result in increased latencies.
+
+ The value is tunable at runtime via a sysctl. A setting of 0
+ (zero) disables the adaptive algorithm entirely.
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a16b13..ea593e0 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1;
int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
#endif

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
/*
* lock->owner state tracking:
*
@@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
* Do the wakeup before the ownership change to give any spinning
* waiter grantees a headstart over the other threads that will
* trigger once owner changes.
+ *
+ * We can skip the actual (expensive) wakeup if the
+ * waiter is already running, but we have to be careful
+ * of race conditions because they may be about to sleep.
+ *
+ * The waiter-side protocol has the following pattern:
+ * 1: Set state != RUNNING
+ * 2: Conditionally sleep if waiter->task != NULL;
+ *
+ * And the owner-side has the following:
+ * A: Set waiter->task = NULL
+ * B: Conditionally wake if the state != RUNNING
+ *
+ * As long as we ensure 1->2 order, and A->B order, we
+ * will never miss a wakeup.
+ *
+ * Therefore, this barrier ensures that waiter->task = NULL
+ * is visible before we test the pendowner->state. The
+ * corresponding barrier is in the sleep logic.
*/
- if (!savestate)
- wake_up_process(pendowner);
- else {
- /*
- * We can skip the actual (expensive) wakeup if the
- * waiter is already running, but we have to be careful
- * of race conditions because they may be about to sleep.
- *
- * The waiter-side protocol has the following pattern:
- * 1: Set state != RUNNING
- * 2: Conditionally sleep if waiter->task != NULL;
- *
- * And the owner-side has the following:
- * A: Set waiter->task = NULL
- * B: Conditionally wake if the state != RUNNING
- *
- * As long as we ensure 1->2 order, and A->B order, we
- * will never miss a wakeup.
- *
- * Therefore, this barrier ensures that waiter->task = NULL
- * is visible before we test the pendowner->state. The
- * corresponding barrier is in the sleep logic.
- */
- smp_mb();
+ smp_mb();

- if ((pendowner->state != TASK_RUNNING)
- && (pendowner->state != TASK_RUNNING_MUTEX))
+ if ((pendowner->state != TASK_RUNNING)
+ && (pendowner->state != TASK_RUNNING_MUTEX)) {
+ if (!savestate)
+ wake_up_process(pendowner);
+ else
wake_up_process_mutex(pendowner);
}

@@ -802,7 +805,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(&waiter);

/* adaptive_wait() returns 1 if we need to sleep */
- if (adaptive_wait(lock, &waiter, &adaptive)) {
+ if (adaptive_wait(lock, 0, &waiter, &adaptive)) {
update_current(TASK_UNINTERRUPTIBLE, &saved_state);
/*
* The xchg() in update_current() is an implicit
@@ -1018,6 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
+ DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive);

debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -1038,8 +1042,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(current->lock_depth >= 0))
saved_lock_depth = rt_release_bkl(lock, flags);

- set_current_state(state);
-
/* Setup the timer, when timeout != NULL */
if (unlikely(timeout))
hrtimer_start(&timeout->timer, timeout->timer.expires,
@@ -1092,6 +1094,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(ret))
break;
}
+
+ mutex_prepare_adaptive_wait(lock, &adaptive);
+
saved_flags = current->flags & PF_NOSCHED;
current->flags &= ~PF_NOSCHED;

@@ -1099,17 +1104,20 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,

debug_rt_mutex_print_deadlock(&waiter);

- if (waiter.task)
- schedule_rt_mutex(lock);
+ if (mutex_adaptive_wait(lock,
+ (state == TASK_INTERRUPTIBLE),
+ &waiter, &adaptive)) {
+ set_current_state(state);
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ set_current_state(TASK_RUNNING);
+ }

spin_lock_irq(&lock->wait_lock);

current->flags |= saved_flags;
- set_current_state(state);
}

- set_current_state(TASK_RUNNING);
-
if (unlikely(waiter.task))
remove_waiter(lock, &waiter, flags);

diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 60c6328..ad5beea 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -60,7 +60,8 @@ struct adaptive_waiter {
*
*/
static inline int
-adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
+adaptive_wait(struct rt_mutex *lock, int interruptible,
+ struct rt_mutex_waiter *waiter,
struct adaptive_waiter *adaptive)
{
int sleep = 0;
@@ -81,6 +82,14 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
if (adaptive->owner != rt_mutex_owner(lock))
break;

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+ /*
+ * Mutexes may need to check for signals...
+ */
+ if (interruptible && signal_pending(current))
+ break;
+#endif
+
/*
* If we got here, presumably the lock ownership is still
* current. We will use it to our advantage to be able to
@@ -136,10 +145,29 @@ extern int rtlock_timeout;

#define DECLARE_ADAPTIVE_WAITER(name)

-#define adaptive_wait(lock, waiter, busy) 1
+#define adaptive_wait(lock, intr, waiter, busy) 1
#define prepare_adaptive_wait(lock, busy) {}

#endif /* CONFIG_ADAPTIVE_RTLOCK */

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+
+#define mutex_adaptive_wait adaptive_wait
+#define mutex_prepare_adaptive_wait prepare_adaptive_wait
+
+extern int rtmutex_timeout;
+
+#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
+ struct adaptive_waiter name = { .owner = NULL, \
+ .timeout = rtmutex_timeout, }
+
+#else
+
+#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
+
+#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
+#define mutex_prepare_adaptive_wait(lock, busy) {}
+
+#endif /* CONFIG_ADAPTIVE_RTMUTEX */

#endif /* __KERNEL_RTMUTEX_ADAPTIVE_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 55189ea..c353e75 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -862,6 +862,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtmutex_timeout",
+ .data = &rtmutex_timeout,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-25 16:31:54

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup

From: Peter W.Morreale <[email protected]>

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner. We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/rtmutex.c | 14 +++++++++-----
1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ea593e0..b81bbef 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+ struct rt_mutex_waiter *next;

spin_lock(&current->pi_lock);

@@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
* waiter with higher priority than pending-owner->normal_prio
* is blocked on the unboosted (pending) owner.
*/
+
+ if (rt_mutex_has_waiters(lock))
+ next = rt_mutex_top_waiter(lock);
+ else
+ next = NULL;
+
spin_lock(&pendowner->pi_lock);

WARN_ON(!pendowner->pi_blocked_on);
@@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)

pendowner->pi_blocked_on = NULL;

- if (rt_mutex_has_waiters(lock)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(lock);
+ if (next)
plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
- }
+
spin_unlock(&pendowner->pi_lock);
}

2008-02-25 16:32:14

by Gregory Haskins

[permalink] [raw]
Subject: [(RT RFC) PATCH v2 9/9] remove the extra call to try_to_take_lock

From: Peter W. Morreale <[email protected]>

Remove the redundant attempt to get the lock. While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/rtmutex.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b81bbef..266ae31 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -756,12 +756,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);

- /* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
- spin_unlock_irqrestore(&lock->wait_lock, flags);
- return;
- }
-
BUG_ON(rt_mutex_owner(lock) == current);

/*

2008-02-25 21:53:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

Hi!

> From: Sven-Thorsten Dietrich <[email protected]>
>
> Add /proc/sys/kernel/lateral_steal, to allow switching on and off
> equal-priority mutex stealing between threads.
>
> Signed-off-by: Sven-Thorsten Dietrich <[email protected]>
> ---
>
> kernel/rtmutex.c | 7 ++++++-
> kernel/sysctl.c | 14 ++++++++++++++
> 2 files changed, 20 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 6624c66..cd39c26 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -18,6 +18,10 @@
>
> #include "rtmutex_common.h"
>
> +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> +int rtmutex_lateral_steal __read_mostly = 1;
> +#endif
> +

Ugly..


> /*
> * lock->owner state tracking:
> *
> @@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
> if (current->prio > pendowner->prio)
> return 0;
>
> - if (!unfair && (current->prio == pendowner->prio))
> + if (unlikely(current->prio == pendowner->prio) &&
> + !(unfair && rtmutex_lateral_steal))
> #endif

But this is even worse, you are creating #ifdef maze here. Can you
simply #define rtmutex_lateral_steal 0 in !CONFIG_RTLOCK_LATERAL_STEAL
and let the optimizer fix this?


> index c913d48..c24c53d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
> int sysctl_legacy_va_layout;
> #endif
>
> +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> +extern int rtmutex_lateral_steal;
> +#endif
> +

Try checkpatch.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 21:54:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

Hi!

> @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> * saved_state accordingly. If we did not get a real wakeup
> * then we return with the saved state.
> */
> - saved_state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
> + saved_state = current->state;
> + smp_mb();
>
> for (;;) {
> unsigned long saved_flags;

Please document what the barrier is good for.

Plus, you are replacing atomic operation with nonatomic; is that ok?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:03:02

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

Hi!

> +/*
> + * Adaptive-rtlocks will busywait when possible, and sleep only if
> + * necessary. Note that the busyloop looks racy, and it is....but we do
> + * not care. If we lose any races it simply means that we spin one more
> + * time before seeing that we need to break-out on the next iteration.
> + *
> + * We realize this is a relatively large function to inline, but note that
> + * it is only instantiated 1 or 2 times max, and it makes a measurable
> + * performance different to avoid the call.
> + *
> + * Returns 1 if we should sleep
> + *
> + */
> +static inline int
> +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
> + struct adaptive_waiter *adaptive)
> +{
> + int sleep = 0;
> +
> + for (;;) {
> + /*
> + * If the task was re-awoken, break out completely so we can
> + * reloop through the lock-acquisition code.
> + */
> + if (!waiter->task)
> + break;
> +
> + /*
> + * We need to break if the owner changed so we can reloop
> + * and safely acquire the owner-pointer again with the
> + * wait_lock held.
> + */
> + if (adaptive->owner != rt_mutex_owner(lock))
> + break;
> +
> + /*
> + * If we got here, presumably the lock ownership is still
> + * current. We will use it to our advantage to be able to
> + * spin without disabling preemption...
> + */
> +
> + /*
> + * .. sleep if the owner is not running..
> + */
> + if (!adaptive->owner->se.on_rq) {
> + sleep = 1;
> + break;
> + }
> +
> + /*
> + * .. or is running on our own cpu (to prevent deadlock)
> + */
> + if (task_cpu(adaptive->owner) == task_cpu(current)) {
> + sleep = 1;
> + break;
> + }
> +
> + cpu_relax();
> + }
> +
> + put_task_struct(adaptive->owner);
> +
> + return sleep;
> +}
> +

You want to inline this?

> +static inline void
> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive)
...
> +#define prepare_adaptive_wait(lock, busy) {}

This is evil. Use empty inline function instead (same for the other
function, there you can maybe get away with it).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:05:45

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

On Mon 2008-02-25 11:01:08, Gregory Haskins wrote:
> From: Sven Dietrich <[email protected]>

Why is this good idea?

> Signed-off-by: Sven Dietrich <[email protected]>
> ---

>
> +config RTLOCK_DELAY
> + int "Default delay (in loops) for adaptive rtlocks"
> + range 0 1000000000
> + depends on ADAPTIVE_RTLOCK
> + default "10000"
> + help
> + This allows you to specify the maximum attempts a task will spin
> + attempting to acquire an rtlock before sleeping. The value is
> + tunable at runtime via a sysctl. A setting of 0 (zero) disables
> + the adaptive algorithm entirely.
> +

I believe you have _way_ too many config variables. If this can be set
at runtime, does it need a config option, too?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:09:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes

Hi!

> From: Peter W.Morreale <[email protected]>
>
> This patch adds the adaptive spin lock busywait to rtmutexes. It adds
> a new tunable: rtmutex_timeout, which is the companion to the
> rtlock_timeout tunable.
>
> Signed-off-by: Peter W. Morreale <[email protected]>

Not signed off by you?

> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index ac1cbad..864bf14 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -214,6 +214,43 @@ config RTLOCK_DELAY
> tunable at runtime via a sysctl. A setting of 0 (zero) disables
> the adaptive algorithm entirely.
>
> +config ADAPTIVE_RTMUTEX
> + bool "Adaptive real-time mutexes"
> + default y
> + depends on ADAPTIVE_RTLOCK
> + help
> + This option adds the adaptive rtlock spin/sleep algorithm to
> + rtmutexes. In rtlocks, a significant gain in throughput
> + can be seen by allowing rtlocks to spin for a distinct
> + amount of time prior to going to sleep for deadlock avoidence.
> +
> + Typically, mutexes are used when a critical section may need to
> + sleep due to a blocking operation. In the event the critical
> + section does not need to sleep, an additional gain in throughput
> + can be seen by avoiding the extra overhead of sleeping.

Watch the whitespace. ... and do we need yet another config options?

> +config RTMUTEX_DELAY
> + int "Default delay (in loops) for adaptive mutexes"
> + range 0 10000000
> + depends on ADAPTIVE_RTMUTEX
> + default "3000"
> + help
> + This allows you to specify the maximum delay a task will use
> + to wait for a rt mutex before going to sleep. Note that that
> + although the delay is implemented as a preemptable loop, tasks
> + of like priority cannot preempt each other and this setting can
> + result in increased latencies.
> +
> + The value is tunable at runtime via a sysctl. A setting of 0
> + (zero) disables the adaptive algorithm entirely.

Ouch.

> +#ifdef CONFIG_ADAPTIVE_RTMUTEX
> +
> +#define mutex_adaptive_wait adaptive_wait
> +#define mutex_prepare_adaptive_wait prepare_adaptive_wait
> +
> +extern int rtmutex_timeout;
> +
> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
> + struct adaptive_waiter name = { .owner = NULL, \
> + .timeout = rtmutex_timeout, }
> +
> +#else
> +
> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
> +
> +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
> +#define mutex_prepare_adaptive_wait(lock, busy) {}

More evil macros. Macro does not behave like a function, make it
inline function if you are replacing a function.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:10:31

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup

On Mon 2008-02-25 11:01:18, Gregory Haskins wrote:
> From: Peter W.Morreale <[email protected]>
>
> In wakeup_next_waiter(), we take the pi_lock, and then find out whether
> we have another waiter to add to the pending owner. We can reduce
> contention on the pi_lock for the pending owner if we first obtain the
> pointer to the next waiter outside of the pi_lock.
>
> This patch adds a measureable increase in throughput.
>
> Signed-off-by: Peter W. Morreale <[email protected]>

Missing sign-off.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:20:09

by Greg KH

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

On Mon, Feb 25, 2008 at 11:06:01PM +0100, Pavel Machek wrote:
> On Mon 2008-02-25 11:01:08, Gregory Haskins wrote:
> > From: Sven Dietrich <[email protected]>
>
> Why is this good idea?

It preserves the original author when the patch is applied by git.
Otherwise the author of the email would be classified as the author of
the patch, which is not the correct thing to have happen a lot of times
(like this one.)

thanks,

greg k-h

2008-02-25 22:21:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

On Mon 2008-02-25 14:19:47, Greg KH wrote:
> On Mon, Feb 25, 2008 at 11:06:01PM +0100, Pavel Machek wrote:
> > On Mon 2008-02-25 11:01:08, Gregory Haskins wrote:
> > > From: Sven Dietrich <[email protected]>
> >
> > Why is this good idea?
>
> It preserves the original author when the patch is applied by git.
> Otherwise the author of the email would be classified as the author of
> the patch, which is not the correct thing to have happen a lot of times
> (like this one.)

Sorry, that was not what I meant. The patch had no changelog at the
top, and I was complaining about that fact.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 22:43:37

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism


On Mon, 2008-02-25 at 23:06 +0100, Pavel Machek wrote:
> On Mon 2008-02-25 11:01:08, Gregory Haskins wrote:
> > From: Sven Dietrich <[email protected]>
>
> Why is this good idea?
>

The timeout is useful to eliminate excessive CPU utilization when
waiting for long-held critical sections.

The patch header should state this :) Will fix.

> > Signed-off-by: Sven Dietrich <[email protected]>
> > ---
>
> >
> > +config RTLOCK_DELAY
> > + int "Default delay (in loops) for adaptive rtlocks"
> > + range 0 1000000000
> > + depends on ADAPTIVE_RTLOCK
> > + default "10000"
> > + help
> > + This allows you to specify the maximum attempts a task will spin
> > + attempting to acquire an rtlock before sleeping. The value is
> > + tunable at runtime via a sysctl. A setting of 0 (zero) disables
> > + the adaptive algorithm entirely.
> > +
>
> I believe you have _way_ too many config variables. If this can be set
> at runtime, does it need a config option, too?
>

Absolutely. The sysctl was added after-the-fact, so this is a relic, we
will remove the config option, and define the default timeout in the
appropriate header.

Sven

> Pavel

2008-02-25 22:57:38

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing


On Mon, 2008-02-25 at 22:53 +0100, Pavel Machek wrote:
> Hi!
>
> > From: Sven-Thorsten Dietrich <[email protected]>
> >
> > Add /proc/sys/kernel/lateral_steal, to allow switching on and off
> > equal-priority mutex stealing between threads.
> >
> > Signed-off-by: Sven-Thorsten Dietrich <[email protected]>
> > ---
> >
> >
> > #include "rtmutex_common.h"
> >

I'll move it to the header file.

>
> > +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> > +int rtmutex_lateral_steal __read_mostly = 1;
> > +#endif
> > +
>
> Ugly..
>

>
> > /*
> > * lock->owner state tracking:
> > *
> > @@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
> > if (current->prio > pendowner->prio)
> > return 0;
> >
> > - if (!unfair && (current->prio == pendowner->prio))
> > + if (unlikely(current->prio == pendowner->prio) &&
> > + !(unfair && rtmutex_lateral_steal))
> > #endif
>
> But this is even worse, you are creating #ifdef maze here. Can you
> simply #define rtmutex_lateral_steal 0 in !CONFIG_RTLOCK_LATERAL_STEAL
> and let the optimizer fix this?
>

Ok - much of this will also disappear into the header then.

>
> > index c913d48..c24c53d 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
> > int sysctl_legacy_va_layout;
> > #endif
> >
> > +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> > +extern int rtmutex_lateral_steal;
> > +#endif
> > +
>
> Try checkpatch.
> Pavel

I have that as part of quilt refresh, and it returns:

Your patch has no obvious style problems and is ready for submission.

But Greg may need to enforce it on his git tree that he mails these from
- are you referring to anything specific in this patch?

Sven

2008-02-25 23:00:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

Hi!

> > > index c913d48..c24c53d 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
> > > int sysctl_legacy_va_layout;
> > > #endif
> > >
> > > +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> > > +extern int rtmutex_lateral_steal;
> > > +#endif
> > > +
> >
> > Try checkpatch.
>
> I have that as part of quilt refresh, and it returns:
>
> Your patch has no obvious style problems and is ready for
> submission.

Sorry. Anyway, there should not be externs in .c files, so

extern int rtmutex_lateral_steal;

should go to header file.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-25 23:40:50

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing


On Tue, 2008-02-26 at 00:00 +0100, Pavel Machek wrote:
> Hi!
>
> > > > index c913d48..c24c53d 100644
> > > > --- a/kernel/sysctl.c
> > > > +++ b/kernel/sysctl.c
> > > > @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
> > > > int sysctl_legacy_va_layout;
> > > > #endif
> > > >
> > > > +#ifdef CONFIG_RTLOCK_LATERAL_STEAL
> > > > +extern int rtmutex_lateral_steal;
> > > > +#endif
> > > > +
> > >
> > > Try checkpatch.
> >
> > I have that as part of quilt refresh, and it returns:
> >
> > Your patch has no obvious style problems and is ready for
> > submission.
>
> Sorry. Anyway, there should not be externs in .c files, so
>
> extern int rtmutex_lateral_steal;
>
> should go to header file.

Just for the record:

grep extern kernel/sysctl.c | wc -l
40

There is a whole slew here, that I agree could move elsewhere - maybe
into sysctl.h?

If no one objects, I'll make a patch to move them all?

Sven

> Pavel

2008-02-26 00:52:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

>>> On Mon, Feb 25, 2008 at 4:54 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:
> Hi!
>
>> @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> * saved_state accordingly. If we did not get a real wakeup
>> * then we return with the saved state.
>> */
>> - saved_state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
>> + saved_state = current->state;
>> + smp_mb();
>>
>> for (;;) {
>> unsigned long saved_flags;
>
> Please document what the barrier is good for.

Yeah, I think you are right that this isn't needed. I think that is a relic from back when I was debugging some other problems. Let me wrap my head around the implications of removing it, and either remove it or document appropriately.

>
> Plus, you are replacing atomic operation with nonatomic; is that ok?

Yeah, I think so. We are substituting a write with a read, and word reads are always atomic anyway IIUC (or is that only true on certain architectures)? Note that we are moving the atomic-write to be done later in the update_current() calls.

-Greg


2008-02-26 00:55:44

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

>>> On Mon, Feb 25, 2008 at 5:03 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:
> Hi!
>
>> +/*
>> + * Adaptive-rtlocks will busywait when possible, and sleep only if
>> + * necessary. Note that the busyloop looks racy, and it is....but we do
>> + * not care. If we lose any races it simply means that we spin one more
>> + * time before seeing that we need to break-out on the next iteration.
>> + *
>> + * We realize this is a relatively large function to inline, but note that
>> + * it is only instantiated 1 or 2 times max, and it makes a measurable
>> + * performance different to avoid the call.
>> + *
>> + * Returns 1 if we should sleep
>> + *
>> + */
>> +static inline int
>> +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
>> + struct adaptive_waiter *adaptive)
>> +{
>> + int sleep = 0;
>> +
>> + for (;;) {
>> + /*
>> + * If the task was re-awoken, break out completely so we can
>> + * reloop through the lock-acquisition code.
>> + */
>> + if (!waiter->task)
>> + break;
>> +
>> + /*
>> + * We need to break if the owner changed so we can reloop
>> + * and safely acquire the owner-pointer again with the
>> + * wait_lock held.
>> + */
>> + if (adaptive->owner != rt_mutex_owner(lock))
>> + break;
>> +
>> + /*
>> + * If we got here, presumably the lock ownership is still
>> + * current. We will use it to our advantage to be able to
>> + * spin without disabling preemption...
>> + */
>> +
>> + /*
>> + * .. sleep if the owner is not running..
>> + */
>> + if (!adaptive->owner->se.on_rq) {
>> + sleep = 1;
>> + break;
>> + }
>> +
>> + /*
>> + * .. or is running on our own cpu (to prevent deadlock)
>> + */
>> + if (task_cpu(adaptive->owner) == task_cpu(current)) {
>> + sleep = 1;
>> + break;
>> + }
>> +
>> + cpu_relax();
>> + }
>> +
>> + put_task_struct(adaptive->owner);
>> +
>> + return sleep;
>> +}
>> +
>
> You want to inline this?

Yes. As the comment indicates, there are 1-2 users tops, and it has a significant impact on throughput (> 5%) to take the hit with a call. I don't think its actually much code anyway...its all comments.

>
>> +static inline void
>> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter
> *adaptive)
> ...
>> +#define prepare_adaptive_wait(lock, busy) {}
>
> This is evil. Use empty inline function instead (same for the other
> function, there you can maybe get away with it).

Ok.


> Pavel


2008-02-26 00:58:59

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes

>>> On Mon, Feb 25, 2008 at 5:09 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:
> Hi!
>
>> From: Peter W.Morreale <[email protected]>
>>
>> This patch adds the adaptive spin lock busywait to rtmutexes. It adds
>> a new tunable: rtmutex_timeout, which is the companion to the
>> rtlock_timeout tunable.
>>
>> Signed-off-by: Peter W. Morreale <[email protected]>
>
> Not signed off by you?

I wasn't sure if this was appropriate for me to do. This is the first time I was acting as "upstream" to someone. If that is what I am expected to do, consider this an "ack" for your remaining comments related to this.

>
>> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
>> index ac1cbad..864bf14 100644
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -214,6 +214,43 @@ config RTLOCK_DELAY
>> tunable at runtime via a sysctl. A setting of 0 (zero) disables
>> the adaptive algorithm entirely.
>>
>> +config ADAPTIVE_RTMUTEX
>> + bool "Adaptive real-time mutexes"
>> + default y
>> + depends on ADAPTIVE_RTLOCK
>> + help
>> + This option adds the adaptive rtlock spin/sleep algorithm to
>> + rtmutexes. In rtlocks, a significant gain in throughput
>> + can be seen by allowing rtlocks to spin for a distinct
>> + amount of time prior to going to sleep for deadlock avoidence.
>> +
>> + Typically, mutexes are used when a critical section may need to
>> + sleep due to a blocking operation. In the event the critical
>> + section does not need to sleep, an additional gain in throughput
>> + can be seen by avoiding the extra overhead of sleeping.
>
> Watch the whitespace. ... and do we need yet another config options?
>
>> +config RTMUTEX_DELAY
>> + int "Default delay (in loops) for adaptive mutexes"
>> + range 0 10000000
>> + depends on ADAPTIVE_RTMUTEX
>> + default "3000"
>> + help
>> + This allows you to specify the maximum delay a task will use
>> + to wait for a rt mutex before going to sleep. Note that that
>> + although the delay is implemented as a preemptable loop, tasks
>> + of like priority cannot preempt each other and this setting can
>> + result in increased latencies.
>> +
>> + The value is tunable at runtime via a sysctl. A setting of 0
>> + (zero) disables the adaptive algorithm entirely.
>
> Ouch.

? Is this reference to whitespace damage, or does the content need addressing?

>
>> +#ifdef CONFIG_ADAPTIVE_RTMUTEX
>> +
>> +#define mutex_adaptive_wait adaptive_wait
>> +#define mutex_prepare_adaptive_wait prepare_adaptive_wait
>> +
>> +extern int rtmutex_timeout;
>> +
>> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
>> + struct adaptive_waiter name = { .owner = NULL, \
>> + .timeout = rtmutex_timeout, }
>> +
>> +#else
>> +
>> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
>> +
>> +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
>> +#define mutex_prepare_adaptive_wait(lock, busy) {}
>
> More evil macros. Macro does not behave like a function, make it
> inline function if you are replacing a function.

Ok


> Pavel


2008-02-26 01:22:38

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

>>> On Mon, Feb 25, 2008 at 5:57 PM, in message
<[email protected]>, Sven-Thorsten Dietrich
<[email protected]> wrote:
>
> But Greg may need to enforce it on his git tree that he mails these from
> - are you referring to anything specific in this patch?
>

Thats what I don't get. I *did* checkpatch all of these before sending them out (and I have for every release).

I am aware of two "tabs vs spaces" warnings, but the rest checked clean. Why do some people still see errors when I don't? Is there a set of switches I should supply to checkpatch to make it more aggressive or something?

-Greg

2008-02-26 15:11:19

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

>>> On Mon, Feb 25, 2008 at 5:03 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:

>> +static inline void
>> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter
> *adaptive)
> ...
>> +#define prepare_adaptive_wait(lock, busy) {}
>
> This is evil. Use empty inline function instead (same for the other
> function, there you can maybe get away with it).
>

I went to implement your suggested change and I remembered why I did it this way: I wanted a macro so that the "struct adaptive_waiter" local variable will fall away without an #ifdef in the main body of code. So I have left this logic alone for now.

2008-02-26 15:16:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

>>> On Mon, Feb 25, 2008 at 5:06 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:
>
> I believe you have _way_ too many config variables. If this can be set
> at runtime, does it need a config option, too?

Generally speaking, I think until this algorithm has an adaptive-timeout in addition to an adaptive-spin/sleep, these .config based defaults are a good idea. Sometimes setting these things at runtime are a PITA when you are talking about embedded systems that might not have/want a nice userspace sysctl-config infrastructure. And changing the defaults in the code is unattractive for some users. I don't think its a big deal either way, so if people hate the config options, they should go. But I thought I would throw this use-case out there to ponder.

Regards,
-Greg

2008-02-26 18:06:26

by Pavel Machek

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

On Tue 2008-02-26 08:03:43, Gregory Haskins wrote:
> >>> On Mon, Feb 25, 2008 at 5:03 PM, in message
> <[email protected]>, Pavel Machek <[email protected]> wrote:
>
> >> +static inline void
> >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter
> > *adaptive)
> > ...
> >> +#define prepare_adaptive_wait(lock, busy) {}
> >
> > This is evil. Use empty inline function instead (same for the other
> > function, there you can maybe get away with it).
> >
>
> I went to implement your suggested change and I remembered why I did it this way: I wanted a macro so that the "struct adaptive_waiter" local variable will fall away without an #ifdef in the main body of code. So I have left this logic alone for now.

Hmm, but inline function will allow dead code elimination, too, no?

Anyway non-evil way to do it with macro is

#define prepare_adaptive_wait(lock, busy) do {} while (0)

...that behaves properly in complex statements.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-26 18:08:59

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

>>> On Tue, Feb 26, 2008 at 1:06 PM, in message
<[email protected]>, Pavel Machek <[email protected]> wrote:
> On Tue 2008-02-26 08:03:43, Gregory Haskins wrote:
>> >>> On Mon, Feb 25, 2008 at 5:03 PM, in message
>> <[email protected]>, Pavel Machek <[email protected]> wrote:
>>
>> >> +static inline void
>> >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter
>> > *adaptive)
>> > ...
>> >> +#define prepare_adaptive_wait(lock, busy) {}
>> >
>> > This is evil. Use empty inline function instead (same for the other
>> > function, there you can maybe get away with it).
>> >
>>
>> I went to implement your suggested change and I remembered why I did it this
> way: I wanted a macro so that the "struct adaptive_waiter" local variable
> will fall away without an #ifdef in the main body of code. So I have left
> this logic alone for now.
>
> Hmm, but inline function will allow dead code elimination, too, no?

I was getting compile errors. Might be operator-error ;)

>
> Anyway non-evil way to do it with macro is
>
> #define prepare_adaptive_wait(lock, busy) do {} while (0)
>
> ...that behaves properly in complex statements.

Ah, I was wondering why people use that. Will do. Thanks!

-Greg

2008-03-03 15:13:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority



/me finally has time to catch up on some email.

On Mon, 25 Feb 2008, Gregory Haskins wrote:

> The current logic only allows lock stealing to occur if the current task
> is of higher priority than the pending owner. We can gain signficant
> throughput improvements (200%+) by allowing the lock-stealing code to
> include tasks of equal priority. The theory is that the system will make
> faster progress by allowing the task already on the CPU to take the lock
> rather than waiting for the system to wake-up a different task.
>
> This does add a degree of unfairness, yes. But also note that the users
> of these locks under non -rt environments have already been using unfair
> raw spinlocks anyway so the tradeoff is probably worth it.
>
> The way I like to think of this is that higher priority tasks should
> clearly preempt, and lower priority tasks should clearly block. However,
> if tasks have an identical priority value, then we can think of the
> scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
> scheduler picked to run first have a logically higher priority amoung tasks
> of the same prio). This helps to keep the system "primed" with tasks doing
> useful work, and the end result is higher throughput.

Interesting. I thought about this when I first implemented it. My thought
was on fairness, and having some worry about starvation. But if you have
two processes of the same RT priority, then you must account for it.

But..., this can cause confusion with having two tasks of the same RT
priority bound to two different CPUS.

CPU0 CPU1
----- ---------
RT task blocks on L1
Owner releases L1
RT task pending owner of L1
Same prio RT task grabs L1
(steals from other RT task)
RT task wakes up to find
L1 stolen and goes
back to sleep.
Releases L1 giving
RT task becomes pending owner
Grabs L1 again and steals it again.
RT wakes up to find
L1 stolen and goes back
to sleep.


See the issue. The RT task on CPU0 may experience huge latencies.
Remember, RT is worried about latencies over performance.
If we can not ***guarantee*** a bounded latency, then, I don't care
how good the perfomance is, it is not good enough for RT.


That said, here's the compromise.

Non-RT tasks care more about overall perfomance than worst case latencies.
So.... See imbedded:


>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> kernel/Kconfig.preempt | 10 ++++++++++
> kernel/rtmutex.c | 31 +++++++++++++++++++++++--------
> 2 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index 41a0d88..e493257 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -196,3 +196,13 @@ config SPINLOCK_BKL
> Say Y here if you are building a kernel for a desktop system.
> Say N if you are unsure.
>
> +config RTLOCK_LATERAL_STEAL
> + bool "Allow equal-priority rtlock stealing"
> + default y
> + depends on PREEMPT_RT
> + help
> + This option alters the rtlock lock-stealing logic to allow
> + equal priority tasks to preempt a pending owner in addition
> + to higher priority tasks. This allows for a significant
> + boost in throughput under certain circumstances at the expense
> + of strict FIFO lock access.

We either do this or we don't. No config option.

> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index a2b00cc..6624c66 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> return ret;
> }
>
> +static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
> +{
> +#ifndef CONFIG_RTLOCK_LATERAL_STEAL
> + if (current->prio >= pendowner->prio)
> +#else
> + if (current->prio > pendowner->prio)
> + return 0;
> +
> + if (!unfair && (current->prio == pendowner->prio))
> +#endif
> + return 0;
> +
> + return 1;
> +}
> +

This instead:


if (rt_task(current) ?
(current->prio >= pendingowner->prio) :
(current->prio > pendingowner->prio))


For RT tasks we keep the FIFO order. This keeps it deterministic.
But for non RT tasks, that still can steal locks, we just simply let them
steal if at a higher priority.

And just use that as the condition. No need to add another inline
function (doesn't make it any more readable).

Actually, this is the only change I see in this patch that is needed.
The rest is simply passing parameters and adding extra unneeded options.

-- Steve




> /*
> * Optimization: check if we can steal the lock from the
> * assigned pending owner [which might not have taken the
> * lock yet]:
> */
> -static inline int try_to_steal_lock(struct rt_mutex *lock)
> +static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
> {
> struct task_struct *pendowner = rt_mutex_owner(lock);
> struct rt_mutex_waiter *next;
> @@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
> return 1;
>
> spin_lock(&pendowner->pi_lock);
> - if (current->prio >= pendowner->prio) {
> + if (!lock_is_stealable(pendowner, unfair)) {
> spin_unlock(&pendowner->pi_lock);
> return 0;
> }
> @@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
> *
> * Must be called with lock->wait_lock held.
> */
> -static int try_to_take_rt_mutex(struct rt_mutex *lock)
> +static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
> {
> /*
> * We have to be careful here if the atomic speedups are
> @@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
> */
> mark_rt_mutex_waiters(lock);
>
> - if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
> + if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair))
> return 0;
>
> /* We got the lock. */
> @@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> int saved_lock_depth = current->lock_depth;
>
> /* Try to acquire the lock */
> - if (try_to_take_rt_mutex(lock))
> + if (try_to_take_rt_mutex(lock, 1))
> break;
> /*
> * waiter.task is NULL the first time we come here and
> @@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
> init_lists(lock);
>
> /* Try to acquire the lock again: */
> - if (try_to_take_rt_mutex(lock)) {
> + if (try_to_take_rt_mutex(lock, 0)) {
> spin_unlock_irqrestore(&lock->wait_lock, flags);
> return 0;
> }
> @@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
> unsigned long saved_flags;
>
> /* Try to acquire the lock: */
> - if (try_to_take_rt_mutex(lock))
> + if (try_to_take_rt_mutex(lock, 0))
> break;
>
> /*
> @@ -1078,7 +1093,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)
>
> init_lists(lock);
>
> - ret = try_to_take_rt_mutex(lock);
> + ret = try_to_take_rt_mutex(lock, 0);
> /*
> * try_to_take_rt_mutex() sets the lock waiters
> * bit unconditionally. Clean this up.
>
>

2008-03-03 15:37:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 4/9] optimize rt lock wakeup


On Mon, 25 Feb 2008, Gregory Haskins wrote:

> It is redundant to wake the grantee task if it is already running

This is only expensive for your adaptive mutex algorithm, not what is in
there today. It belongs _after_ the adaptive mutex patch in the series.

-- Steve

2008-03-03 15:48:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to includelateral priority

>>> On Mon, Mar 3, 2008 at 10:13 AM, in message
<[email protected]>, Steven Rostedt
<[email protected]> wrote:


>
> See the issue. The RT task on CPU0 may experience huge latencies.

Agreed, but equal priority threads can always cause unbounded latencies by definition. I.e. we only guarantee to the highest thread.

> Remember, RT is worried about latencies over performance.
> If we can not ***guarantee*** a bounded latency, then, I don't care
> how good the perfomance is, it is not good enough for RT.
>
>
> That said, here's the compromise.
>
> Non-RT tasks care more about overall perfomance than worst case latencies.
> So.... See imbedded:

This isn't a bad idea, but note that it means RT tasks will not get a performance boost, which is quite substantial.

Note that I have substantially cleaned up this patch for the drop I will make later this week (v3).

-Greg

2008-03-03 15:49:12

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 4/9] optimize rt lock wakeup

>>> On Mon, Mar 3, 2008 at 10:37 AM, in message
<[email protected]>, Steven Rostedt
<[email protected]> wrote:

> On Mon, 25 Feb 2008, Gregory Haskins wrote:
>
>> It is redundant to wake the grantee task if it is already running
>
> This is only expensive for your adaptive mutex algorithm, not what is in
> there today. It belongs _after_ the adaptive mutex patch in the series.
>
> -- Steve

Fair enough.

-Greg

2008-03-03 15:55:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to includelateral priority


On Mon, 3 Mar 2008, Gregory Haskins wrote:

> >>> On Mon, Mar 3, 2008 at 10:13 AM, in message
> >
> > See the issue. The RT task on CPU0 may experience huge latencies.
>
> Agreed, but equal priority threads can always cause unbounded latencies by definition. I.e. we only guarantee to the highest thread.

It should not when they are bounded to two separate CPUs, and are the
highest priority tasks on those CPUS. That will be hard to explain, how a
the highest prio tasks bounded to a single CPU had an unbounded latency.
With the patches presented, this can happen.

>
> > Remember, RT is worried about latencies over performance.
> > If we can not ***guarantee*** a bounded latency, then, I don't care
> > how good the perfomance is, it is not good enough for RT.
> >
> >
> > That said, here's the compromise.
> >
> > Non-RT tasks care more about overall perfomance than worst case latencies.
> > So.... See imbedded:
>
> This isn't a bad idea, but note that it means RT tasks will not get a performance boost, which is quite substantial.

Again, no performance is good enough for an RT task if it risks the
slightest chance of causing an unbounded latency.

-- Steve

>
> Note that I have substantially cleaned up this patch for the drop I will make later this week (v3).

2008-03-03 16:02:22

by Gregory Haskins

[permalink] [raw]
Subject: Re: [(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing toincludelateral priority

>>> On Mon, Mar 3, 2008 at 10:55 AM, in message
<[email protected]>, Steven Rostedt
<[email protected]> wrote:

> On Mon, 3 Mar 2008, Gregory Haskins wrote:
>
>> >>> On Mon, Mar 3, 2008 at 10:13 AM, in message
>> >
>> > See the issue. The RT task on CPU0 may experience huge latencies.
>>
>> Agreed, but equal priority threads can always cause unbounded latencies by
> definition. I.e. we only guarantee to the highest thread.
>
> It should not when they are bounded to two separate CPUs, and are the
> highest priority tasks on those CPUS. That will be hard to explain, how a
> the highest prio tasks bounded to a single CPU had an unbounded latency.
> With the patches presented, this can happen.

Indeed. I see your point. I will make this change in the next revision.

Thanks for the review!
-Greg