2008-02-21 15:54:46

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 00/14] RFC - adaptive real-time locks

The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
"spinlocks" to an "RT mutex" implementation that support a priority
inheritance protocol, and priority-ordered wait queues. The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way. This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel. It applies to 2.6.24-rt1.

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency. For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time. This works because
most locks are typically held for very short time spans. Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490.

Some tests show a marked improvement (for instance, dbench and hackbench),
whereas some others (make -j 128) the results were not as profound but
they were still net-positive. In all cases we have also verified that
deterministic latency is not impacted by using cyclic-test.

This patch series also includes some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks. We found that
the increased pressure on the lock->wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation. Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25). Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

Special thanks go to many people who were instrumental to this project,
including:
*) the -rt team here at Novell for research, development, and testing.
*) Nick Piggin for his invaluable consultation/feedback and use of his
x86-ticket-locks.
*) The reviewers/testers at Suse, Montavista, and Bill Huey for their
time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg


2008-02-21 15:54:25

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled

The logic is currently broken so that PREEMPT_RT disables preemptible
spinlock waiters, which is counter intuitive.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/spinlock.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index c9bcf1b..b0e7f02 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
* not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/
#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
- defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT)
+ defined(CONFIG_DEBUG_LOCK_ALLOC)

void __lockfunc __read_lock(raw_rwlock_t *lock)
{

2008-02-21 15:55:27

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option

We introduce a configuration variable for the feature to make it easier for
various architectures and/or configs to enable or disable it based on their
requirements.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/Kconfig.preempt | 9 +++++++++
kernel/spinlock.c | 7 +++----
lib/Kconfig.debug | 1 +
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..5b45213 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,6 +86,15 @@ config PREEMPT
default y
depends on PREEMPT_DESKTOP || PREEMPT_RT

+config DISABLE_PREEMPT_SPINLOCK_WAITERS
+ bool
+ default n
+
+config PREEMPT_SPINLOCK_WAITERS
+ bool
+ default y
+ depends on PREEMPT && SMP && !DISABLE_PREEMPT_SPINLOCK_WAITERS
+
config PREEMPT_SOFTIRQS
bool "Thread Softirqs"
default n
diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index b0e7f02..2e6a904 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
* even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
* not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
- defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS)

void __lockfunc __read_lock(raw_rwlock_t *lock)
{
@@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock)

EXPORT_SYMBOL(__write_lock);

-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */

/*
* This could be a long-held lock. We both prepare to spin for a long
@@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock);
BUILD_LOCK_OPS(read, raw_rwlock);
BUILD_LOCK_OPS(write, raw_rwlock);

-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */

#ifdef CONFIG_DEBUG_LOCK_ALLOC

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9208791..f2889b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC
bool "Lock debugging: detect incorrect freeing of live locks"
depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
select DEBUG_SPINLOCK
+ select DISABLE_PREEMPT_SPINLOCK_WAITERS
select DEBUG_MUTEXES
select LOCKDEP
help

2008-02-21 15:55:54

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 03/14] x86: FIFO ticket spinlocks

From: Nick Piggin <[email protected]>

Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again. These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin <[email protected]>
---

include/asm-x86/spinlock.h | 225 ++++++++++++++++++++++++++++++++++++++
include/asm-x86/spinlock_32.h | 221 -------------------------------------
include/asm-x86/spinlock_64.h | 167 ----------------------------
include/asm-x86/spinlock_types.h | 2
4 files changed, 224 insertions(+), 391 deletions(-)

diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h
index d74d85e..72fe445 100644
--- a/include/asm-x86/spinlock.h
+++ b/include/asm-x86/spinlock.h
@@ -1,5 +1,226 @@
+#ifndef _X86_SPINLOCK_H_
+#define _X86_SPINLOCK_H_
+
+#include <asm/atomic.h>
+#include <asm/rwlock.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+#include <linux/compiler.h>
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations. There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
#ifdef CONFIG_X86_32
-# include "spinlock_32.h"
+typedef char _slock_t;
+# define LOCK_INS_DEC "decb"
+# define LOCK_INS_XCH "xchgb"
+# define LOCK_INS_MOV "movb"
+# define LOCK_INS_CMP "cmpb"
+# define LOCK_PTR_REG "a"
#else
-# include "spinlock_64.h"
+typedef int _slock_t;
+# define LOCK_INS_DEC "decl"
+# define LOCK_INS_XCH "xchgl"
+# define LOCK_INS_MOV "movl"
+# define LOCK_INS_CMP "cmpl"
+# define LOCK_PTR_REG "D"
+#endif
+
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(__raw_spinlock_t *lock)
+{
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) != (tmp & 0xff));
+}
+
+static inline int __raw_spin_is_contended(__raw_spinlock_t *lock)
+{
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
+}
+
+static inline void __raw_spin_lock(__raw_spinlock_t *lock)
+{
+ short inc = 0x0100;
+
+ /*
+ * Ticket locks are conceptually two bytes, one indicating the current
+ * head of the queue, and the other indicating the current tail. The
+ * lock is acquired by atomically noting the tail and incrementing it
+ * by one (thus adding ourself to the queue and noting our position),
+ * then waiting until the head becomes equal to the the initial value
+ * of the tail.
+ *
+ * This uses a 16-bit xadd to increment the tail and also load the
+ * position of the head, which takes care of memory ordering issues
+ * and should be optimal for the uncontended case. Note the tail must
+ * be in the high byte, otherwise the 16-bit wide increment of the low
+ * byte would carry up and contaminate the high byte.
+ */
+
+ __asm__ __volatile__ (
+ LOCK_PREFIX "xaddw %w0, %1\n"
+ "1:\t"
+ "cmpb %h0, %b0\n\t"
+ "je 2f\n\t"
+ "rep ; nop\n\t"
+ "movb %1, %b0\n\t"
+ /* don't need lfence here, because loads are in-order */
+ "jmp 1b\n"
+ "2:"
+ :"+Q" (inc), "+m" (lock->slock)
+ :
+ :"memory", "cc");
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int __raw_spin_trylock(__raw_spinlock_t *lock)
+{
+ int tmp;
+ short new;
+
+ asm volatile(
+ "movw %2,%w0\n\t"
+ "cmpb %h0,%b0\n\t"
+ "jne 1f\n\t"
+ "movw %w0,%w1\n\t"
+ "incb %h1\n\t"
+ "lock ; cmpxchgw %w1,%2\n\t"
+ "1:"
+ "sete %b1\n\t"
+ "movzbl %b1,%0\n\t"
+ :"=&a" (tmp), "=Q" (new), "+m" (lock->slock)
+ :
+ : "memory", "cc");
+
+ return tmp;
+}
+
+#if defined(CONFIG_X86_32) && \
+ (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
+/*
+ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
+ * (PPro errata 66, 92)
+ */
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+#else
+# define UNLOCK_LOCK_PREFIX
+#endif
+
+static inline void __raw_spin_unlock(__raw_spinlock_t *lock)
+{
+ __asm__ __volatile__(
+ UNLOCK_LOCK_PREFIX "incb %0"
+ :"+m" (lock->slock)
+ :
+ :"memory", "cc");
+}
+
+static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock)
+{
+ while (__raw_spin_is_locked(lock))
+ cpu_relax();
+}
+
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ *
+ * On x86, we implement read-write locks as a 32-bit counter
+ * with the high bit (sign) being the "contended" bit.
+ */
+
+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_read_can_lock(__raw_rwlock_t *lock)
+{
+ return (int)(lock)->lock > 0;
+}
+
+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_write_can_lock(__raw_rwlock_t *lock)
+{
+ return (lock)->lock == RW_LOCK_BIAS;
+}
+
+static inline void __raw_read_lock(__raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t"
+ "jns 1f\n"
+ "call __read_lock_failed\n\t"
+ "1:\n"
+ ::LOCK_PTR_REG (rw) : "memory");
+}
+
+static inline void __raw_write_lock(__raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX " subl %1,(%0)\n\t"
+ "jz 1f\n"
+ "call __write_lock_failed\n\t"
+ "1:\n"
+ ::LOCK_PTR_REG (rw), "i" (RW_LOCK_BIAS) : "memory");
+}
+
+static inline int __raw_read_trylock(__raw_rwlock_t *lock)
+{
+ atomic_t *count = (atomic_t *)lock;
+
+ atomic_dec(count);
+ if (atomic_read(count) >= 0)
+ return 1;
+ atomic_inc(count);
+ return 0;
+}
+
+static inline int __raw_write_trylock(__raw_rwlock_t *lock)
+{
+ atomic_t *count = (atomic_t *)lock;
+
+ if (atomic_sub_and_test(RW_LOCK_BIAS, count))
+ return 1;
+ atomic_add(RW_LOCK_BIAS, count);
+ return 0;
+}
+
+static inline void __raw_read_unlock(__raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
+}
+
+static inline void __raw_write_unlock(__raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "addl %1, %0"
+ : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
+}
+
+#define _raw_spin_relax(lock) cpu_relax()
+#define _raw_read_relax(lock) cpu_relax()
+#define _raw_write_relax(lock) cpu_relax()
+
#endif
diff --git a/include/asm-x86/spinlock_32.h b/include/asm-x86/spinlock_32.h
deleted file mode 100644
index 0c6de90..0000000
--- a/include/asm-x86/spinlock_32.h
+++ /dev/null
@@ -1,221 +0,0 @@
-#ifndef __ASM_SPINLOCK_H
-#define __ASM_SPINLOCK_H
-
-#include <asm/atomic.h>
-#include <asm/rwlock.h>
-#include <asm/page.h>
-#include <asm/processor.h>
-#include <linux/compiler.h>
-
-#ifdef CONFIG_PARAVIRT
-#include <asm/paravirt.h>
-#else
-#define CLI_STRING "cli"
-#define STI_STRING "sti"
-#define CLI_STI_CLOBBERS
-#define CLI_STI_INPUT_ARGS
-#endif /* CONFIG_PARAVIRT */
-
-/*
- * Your basic SMP spinlocks, allowing only a single CPU anywhere
- *
- * Simple spin lock operations. There are two variants, one clears IRQ's
- * on the local processor, one does not.
- *
- * We make no fairness assumptions. They have a cost.
- *
- * (the type definitions are in asm/spinlock_types.h)
- */
-
-static inline int __raw_spin_is_locked(__raw_spinlock_t *x)
-{
- return *(volatile signed char *)(&(x)->slock) <= 0;
-}
-
-static inline void __raw_spin_lock(__raw_spinlock_t *lock)
-{
- asm volatile("\n1:\t"
- LOCK_PREFIX " ; decb %0\n\t"
- "jns 3f\n"
- "2:\t"
- "rep;nop\n\t"
- "cmpb $0,%0\n\t"
- "jle 2b\n\t"
- "jmp 1b\n"
- "3:\n\t"
- : "+m" (lock->slock) : : "memory");
-}
-
-/*
- * It is easier for the lock validator if interrupts are not re-enabled
- * in the middle of a lock-acquire. This is a performance feature anyway
- * so we turn it off:
- *
- * NOTE: there's an irqs-on section here, which normally would have to be
- * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use this variant.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags)
-{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decb %[slock]\n\t"
- "jns 5f\n"
- "2:\t"
- "testl $0x200, %[flags]\n\t"
- "jz 4f\n\t"
- STI_STRING "\n"
- "3:\t"
- "rep;nop\n\t"
- "cmpb $0, %[slock]\n\t"
- "jle 3b\n\t"
- CLI_STRING "\n\t"
- "jmp 1b\n"
- "4:\t"
- "rep;nop\n\t"
- "cmpb $0, %[slock]\n\t"
- "jg 1b\n\t"
- "jmp 4b\n"
- "5:\n\t"
- : [slock] "+m" (lock->slock)
- : [flags] "r" (flags)
- CLI_STI_INPUT_ARGS
- : "memory" CLI_STI_CLOBBERS);
-}
-#endif
-
-static inline int __raw_spin_trylock(__raw_spinlock_t *lock)
-{
- char oldval;
- asm volatile(
- "xchgb %b0,%1"
- :"=q" (oldval), "+m" (lock->slock)
- :"0" (0) : "memory");
- return oldval > 0;
-}
-
-/*
- * __raw_spin_unlock based on writing $1 to the low byte.
- * This method works. Despite all the confusion.
- * (except on PPro SMP or if we are using OOSTORE, so we use xchgb there)
- * (PPro errata 66, 92)
- */
-
-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
-
-static inline void __raw_spin_unlock(__raw_spinlock_t *lock)
-{
- asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory");
-}
-
-#else
-
-static inline void __raw_spin_unlock(__raw_spinlock_t *lock)
-{
- char oldval = 1;
-
- asm volatile("xchgb %b0, %1"
- : "=q" (oldval), "+m" (lock->slock)
- : "0" (oldval) : "memory");
-}
-
-#endif
-
-static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock)
-{
- while (__raw_spin_is_locked(lock))
- cpu_relax();
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- *
- * On x86, we implement read-write locks as a 32-bit counter
- * with the high bit (sign) being the "contended" bit.
- *
- * The inline assembly is non-obvious. Think about it.
- *
- * Changed to use the same technique as rw semaphores. See
- * semaphore.h for details. -ben
- *
- * the helpers are in arch/i386/kernel/semaphore.c
- */
-
-/**
- * read_can_lock - would read_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_read_can_lock(__raw_rwlock_t *x)
-{
- return (int)(x)->lock > 0;
-}
-
-/**
- * write_can_lock - would write_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_write_can_lock(__raw_rwlock_t *x)
-{
- return (x)->lock == RW_LOCK_BIAS;
-}
-
-static inline void __raw_read_lock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t"
- "jns 1f\n"
- "call __read_lock_failed\n\t"
- "1:\n"
- ::"a" (rw) : "memory");
-}
-
-static inline void __raw_write_lock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t"
- "jz 1f\n"
- "call __write_lock_failed\n\t"
- "1:\n"
- ::"a" (rw) : "memory");
-}
-
-static inline int __raw_read_trylock(__raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- atomic_dec(count);
- if (atomic_read(count) >= 0)
- return 1;
- atomic_inc(count);
- return 0;
-}
-
-static inline int __raw_write_trylock(__raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- if (atomic_sub_and_test(RW_LOCK_BIAS, count))
- return 1;
- atomic_add(RW_LOCK_BIAS, count);
- return 0;
-}
-
-static inline void __raw_read_unlock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
-}
-
-static inline void __raw_write_unlock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0"
- : "+m" (rw->lock) : : "memory");
-}
-
-#define __raw_spin_relax(lock) cpu_relax()
-#define __raw_read_relax(lock) cpu_relax()
-#define __raw_write_relax(lock) cpu_relax()
-
-#endif /* __ASM_SPINLOCK_H */
diff --git a/include/asm-x86/spinlock_64.h b/include/asm-x86/spinlock_64.h
deleted file mode 100644
index 8b76dc0..0000000
--- a/include/asm-x86/spinlock_64.h
+++ /dev/null
@@ -1,167 +0,0 @@
-#ifndef __ASM_SPINLOCK_H
-#define __ASM_SPINLOCK_H
-
-#include <asm/atomic.h>
-#include <asm/rwlock.h>
-#include <asm/page.h>
-#include <asm/processor.h>
-
-/*
- * Your basic SMP spinlocks, allowing only a single CPU anywhere
- *
- * Simple spin lock operations. There are two variants, one clears IRQ's
- * on the local processor, one does not.
- *
- * We make no fairness assumptions. They have a cost.
- *
- * (the type definitions are in asm/spinlock_types.h)
- */
-
-static inline int __raw_spin_is_locked(__raw_spinlock_t *lock)
-{
- return *(volatile signed int *)(&(lock)->slock) <= 0;
-}
-
-static inline void __raw_spin_lock(__raw_spinlock_t *lock)
-{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decl %0\n\t"
- "jns 2f\n"
- "3:\n"
- "rep;nop\n\t"
- "cmpl $0,%0\n\t"
- "jle 3b\n\t"
- "jmp 1b\n"
- "2:\t" : "=m" (lock->slock) : : "memory");
-}
-
-/*
- * Same as __raw_spin_lock, but reenable interrupts during spinning.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags)
-{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decl %0\n\t"
- "jns 5f\n"
- "testl $0x200, %1\n\t" /* interrupts were disabled? */
- "jz 4f\n\t"
- "sti\n"
- "3:\t"
- "rep;nop\n\t"
- "cmpl $0, %0\n\t"
- "jle 3b\n\t"
- "cli\n\t"
- "jmp 1b\n"
- "4:\t"
- "rep;nop\n\t"
- "cmpl $0, %0\n\t"
- "jg 1b\n\t"
- "jmp 4b\n"
- "5:\n\t"
- : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
-}
-#endif
-
-static inline int __raw_spin_trylock(__raw_spinlock_t *lock)
-{
- int oldval;
-
- asm volatile(
- "xchgl %0,%1"
- :"=q" (oldval), "=m" (lock->slock)
- :"0" (0) : "memory");
-
- return oldval > 0;
-}
-
-static inline void __raw_spin_unlock(__raw_spinlock_t *lock)
-{
- asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
-}
-
-static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock)
-{
- while (__raw_spin_is_locked(lock))
- cpu_relax();
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- *
- * On x86, we implement read-write locks as a 32-bit counter
- * with the high bit (sign) being the "contended" bit.
- */
-
-static inline int __raw_read_can_lock(__raw_rwlock_t *lock)
-{
- return (int)(lock)->lock > 0;
-}
-
-static inline int __raw_write_can_lock(__raw_rwlock_t *lock)
-{
- return (lock)->lock == RW_LOCK_BIAS;
-}
-
-static inline void __raw_read_lock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
- "jns 1f\n"
- "call __read_lock_failed\n"
- "1:\n"
- ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory");
-}
-
-static inline void __raw_write_lock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
- "jz 1f\n"
- "\tcall __write_lock_failed\n\t"
- "1:\n"
- ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory");
-}
-
-static inline int __raw_read_trylock(__raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- atomic_dec(count);
- if (atomic_read(count) >= 0)
- return 1;
- atomic_inc(count);
- return 0;
-}
-
-static inline int __raw_write_trylock(__raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- if (atomic_sub_and_test(RW_LOCK_BIAS, count))
- return 1;
- atomic_add(RW_LOCK_BIAS, count);
- return 0;
-}
-
-static inline void __raw_read_unlock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX " ; incl %0" :"=m" (rw->lock) : : "memory");
-}
-
-static inline void __raw_write_unlock(__raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX " ; addl $" RW_LOCK_BIAS_STR ",%0"
- : "=m" (rw->lock) : : "memory");
-}
-
-#define __raw_spin_relax(lock) cpu_relax()
-#define __raw_read_relax(lock) cpu_relax()
-#define __raw_write_relax(lock) cpu_relax()
-
-#endif /* __ASM_SPINLOCK_H */
diff --git a/include/asm-x86/spinlock_types.h b/include/asm-x86/spinlock_types.h
index 8b0ccbc..e7588d4 100644
--- a/include/asm-x86/spinlock_types.h
+++ b/include/asm-x86/spinlock_types.h
@@ -9,7 +9,7 @@ typedef struct {
unsigned int slock;
} __raw_spinlock_t;

-#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED { 0 }

typedef struct {
unsigned int lock;

2008-02-21 15:56:25

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use

Preemptible spinlock waiters effectively bypasses the benefits of a fifo
spinlock. Since we now have fifo spinlocks for x86 enabled, disable the
preemption feature on x86.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Nick Piggin <[email protected]>
---

arch/x86/Kconfig | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8d15667..d5b9a67 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -20,6 +20,7 @@ config X86
bool
default y
select HAVE_MCOUNT
+ select DISABLE_PREEMPT_SPINLOCK_WAITERS

config GENERIC_TIME
bool

2008-02-21 15:56:46

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep

The current logic makes rather coarse adjustments to current->state since
it is planning on sleeping anyway. We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule(). This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 20 +++++++++++++++-----
1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..15fc6e6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
}

+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+ unsigned long state = xchg(&current->state, new_state);
+ if (unlikely(state == TASK_RUNNING))
+ *saved_state = TASK_RUNNING;
+}
+
/*
* Slow path lock function spin_lock style: this variant is very
* careful not to miss any non-lock wakeups.
@@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
* saved_state accordingly. If we did not get a real wakeup
* then we return with the saved state.
*/
- saved_state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
+ saved_state = current->state;
+ smp_mb();

for (;;) {
unsigned long saved_flags;
@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)

debug_rt_mutex_print_deadlock(&waiter);

- schedule_rt_mutex(lock);
+ update_current(TASK_UNINTERRUPTIBLE, &saved_state);
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ else
+ update_current(TASK_RUNNING_MUTEX, &saved_state);

spin_lock_irqsave(&lock->wait_lock, flags);
current->flags |= saved_flags;
current->lock_depth = saved_lock_depth;
- state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
- if (unlikely(state == TASK_RUNNING))
- saved_state = TASK_RUNNING;
}

state = xchg(&current->state, saved_state);

2008-02-21 15:57:21

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 06/14] optimize rt lock wakeup

It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins <[email protected]>
Signed-off-by: Peter Morreale <[email protected]>
---

kernel/rtmutex.c | 23 ++++++++++++++++++-----
1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 15fc6e6..cb27b08 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
pendowner = waiter->task;
waiter->task = NULL;

+ /*
+ * Do the wakeup before the ownership change to give any spinning
+ * waiter grantees a headstart over the other threads that will
+ * trigger once owner changes.
+ */
+ if (!savestate)
+ wake_up_process(pendowner);
+ else {
+ smp_mb();
+ /*
+ * This may appear to be a race, but the barriers close the
+ * window.
+ */
+ if ((pendowner->state != TASK_RUNNING)
+ && (pendowner->state != TASK_RUNNING_MUTEX))
+ wake_up_process_mutex(pendowner);
+ }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);

spin_unlock(&current->pi_lock);
@@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
}
spin_unlock(&pendowner->pi_lock);
-
- if (savestate)
- wake_up_process_mutex(pendowner);
- else
- wake_up_process(pendowner);
}

/*

2008-02-21 15:57:54

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 07/14] adaptive real-time lock support

There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping). Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks. However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks. This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention. Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent. Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem. They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins <[email protected]>
Signed-off-by: Peter Morreale <[email protected]>
Signed-off-by: Sven Dietrich <[email protected]>
---

kernel/Kconfig.preempt | 20 +++++++
kernel/rtmutex.c | 19 +++++-
kernel/rtmutex_adaptive.h | 134 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 5b45213..6568519 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,6 +192,26 @@ config RCU_TRACE
Say Y/M here if you want to enable RCU tracing in-kernel/module.
Say N if you are unsure.

+config ADAPTIVE_RTLOCK
+ bool "Adaptive real-time locks"
+ default y
+ depends on PREEMPT_RT && SMP
+ help
+ PREEMPT_RT allows for greater determinism by transparently
+ converting normal spinlock_ts into preemptible rtmutexes which
+ sleep any waiters under contention. However, in many cases the
+ lock will be released in less time than it takes to context
+ switch. Therefore, the "sleep under contention" policy may also
+ degrade throughput performance due to the extra context switches.
+
+ This option alters the rtmutex derived spinlock_t replacement
+ code to use an adaptive spin/sleep algorithm. It will spin
+ unless it determines it must sleep to avoid deadlock. This
+ offers a best of both worlds solution since we achieve both
+ high-throughput and low-latency.
+
+ If unsure, say Y
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cb27b08..feb938f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,7 @@
* Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[email protected]>
* Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
* Copyright (C) 2006 Esben Nielsen
+ * Copyright (C) 2008 Novell, Inc.
*
* See Documentation/rt-mutex-design.txt for details.
*/
@@ -17,6 +18,7 @@
#include <linux/hardirq.h>

#include "rtmutex_common.h"
+#include "rtmutex_adaptive.h"

/*
* lock->owner state tracking:
@@ -697,6 +699,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
{
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+ DECLARE_ADAPTIVE_WAITER(adaptive);

debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -743,6 +746,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}

+ prepare_adaptive_wait(lock, &adaptive);
+
/*
* Prevent schedule() to drop BKL, while waiting for
* the lock ! We restore lock_depth when we come back.
@@ -754,11 +759,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)

debug_rt_mutex_print_deadlock(&waiter);

- update_current(TASK_UNINTERRUPTIBLE, &saved_state);
- if (waiter.task)
- schedule_rt_mutex(lock);
- else
- update_current(TASK_RUNNING_MUTEX, &saved_state);
+ /* adaptive_wait() returns 1 if we need to sleep */
+ if (adaptive_wait(lock, &waiter, &adaptive)) {
+ update_current(TASK_UNINTERRUPTIBLE, &saved_state);
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ else
+ update_current(TASK_RUNNING_MUTEX,
+ &saved_state);
+ }

spin_lock_irqsave(&lock->wait_lock, flags);
current->flags |= saved_flags;
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
new file mode 100644
index 0000000..505fed5
--- /dev/null
+++ b/kernel/rtmutex_adaptive.h
@@ -0,0 +1,134 @@
+/*
+ * Adaptive RT lock support
+ *
+ * There are pros and cons when deciding between the two basic forms of
+ * locking primitives (spinning vs sleeping). Without going into great
+ * detail on either one, we note that spinlocks have the advantage of
+ * lower overhead for short hold locks. However, they also have a
+ * con in that they create indeterminate latencies since preemption
+ * must traditionally be disabled while the lock is held (to prevent deadlock).
+ *
+ * We want to avoid non-deterministic critical sections in -rt. Therefore,
+ * when realtime is enabled, most contexts are converted to threads, and
+ * likewise most spinlock_ts are converted to sleepable rt-mutex derived
+ * locks. This allows the holder of the lock to remain fully preemptible,
+ * thus reducing a major source of latencies in the kernel.
+ *
+ * However, converting what was once a true spinlock into a sleeping lock
+ * may also decrease performance since the locks will now sleep under
+ * contention. Since the fundamental lock used to be a spinlock, it is
+ * highly likely that it was used in a short-hold path and that release
+ * is imminent. Therefore sleeping only serves to cause context-thrashing.
+ *
+ * Adaptive RT locks use a hybrid approach to solve the problem. They
+ * spin when possible, and sleep when necessary (to avoid deadlock, etc).
+ * This significantly improves many areas of the performance of the -rt
+ * kernel.
+ *
+ * Copyright (C) 2008 Novell, Inc.,
+ * Sven Dietrich, Peter Morreale, and Gregory Haskins
+ *
+ */
+
+#ifndef __KERNEL_RTMUTEX_ADAPTIVE_H
+#define __KERNEL_RTMUTEX_ADAPTIVE_H
+
+#include "rtmutex_common.h"
+
+
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+struct adaptive_waiter {
+ struct task_struct *owner;
+};
+
+/*
+ * Adaptive-rtlocks will busywait when possible, and sleep only if
+ * necessary. Note that the busyloop looks racy, and it is....but we do
+ * not care. If we lose any races it simply means that we spin one more
+ * time before seeing that we need to break-out on the next iteration.
+ *
+ * We realize this is a relatively large function to inline, but note that
+ * it is only instantiated 1 or 2 times max, and it makes a measurable
+ * performance different to avoid the call.
+ *
+ * Returns 1 if we should sleep
+ *
+ */
+static inline int
+adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
+ struct adaptive_waiter *adaptive)
+{
+ int sleep = 0;
+
+ for (;;) {
+ /*
+ * If the task was re-awoken, break out completely so we can
+ * reloop through the lock-acquisition code.
+ */
+ if (!waiter->task)
+ break;
+
+ /*
+ * We need to break if the owner changed so we can reloop
+ * and safely acquire the owner-pointer again with the
+ * wait_lock held.
+ */
+ if (adaptive->owner != rt_mutex_owner(lock))
+ break;
+
+ /*
+ * If we got here, presumably the lock ownership is still
+ * current. We will use it to our advantage to be able to
+ * spin without disabling preemption...
+ */
+
+ /*
+ * .. sleep if the owner is not running..
+ */
+ if (!adaptive->owner->se.on_rq) {
+ sleep = 1;
+ break;
+ }
+
+ /*
+ * .. or is running on our own cpu (to prevent deadlock)
+ */
+ if (task_cpu(adaptive->owner) == task_cpu(current)) {
+ sleep = 1;
+ break;
+ }
+
+ cpu_relax();
+ }
+
+ put_task_struct(adaptive->owner);
+
+ return sleep;
+}
+
+static inline void
+prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive)
+{
+ /*
+ * We must acquire/lock the owner pointer while holding
+ * the wait_lock, or we risk racing against the owner
+ * exiting.
+ */
+ adaptive->owner = rt_mutex_owner(lock);
+ get_task_struct(adaptive->owner);
+}
+
+#define DECLARE_ADAPTIVE_WAITER(name) \
+ struct adaptive_waiter name = { .owner = NULL, }
+
+#else
+
+#define DECLARE_ADAPTIVE_WAITER(name)
+
+#define adaptive_wait(lock, waiter, busy) 1
+#define prepare_adaptive_wait(lock, busy) {}
+
+#endif /* CONFIG_ADAPTIVE_RTLOCK */
+
+
+#endif /* __KERNEL_RTMUTEX_ADAPTIVE_H */

2008-02-21 15:58:36

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

From: Sven Dietrich <[email protected]>

Signed-off-by: Sven Dietrich <[email protected]>
---

kernel/Kconfig.preempt | 11 +++++++++++
kernel/rtmutex.c | 4 ++++
kernel/rtmutex_adaptive.h | 11 +++++++++--
kernel/sysctl.c | 12 ++++++++++++
4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK

If unsure, say Y

+config RTLOCK_DELAY
+ int "Default delay (in loops) for adaptive rtlocks"
+ range 0 1000000000
+ depends on ADAPTIVE_RTLOCK
+ default "10000"
+ help
+ This allows you to specify the maximum attempts a task will spin
+ attempting to acquire an rtlock before sleeping. The value is
+ tunable at runtime via a sysctl. A setting of 0 (zero) disables
+ the adaptive algorithm entirely.
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
#include "rtmutex_common.h"
#include "rtmutex_adaptive.h"

+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
/*
* lock->owner state tracking:
*
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
#ifdef CONFIG_ADAPTIVE_RTLOCK
struct adaptive_waiter {
struct task_struct *owner;
+ int timeout;
};

/*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
{
int sleep = 0;

- for (;;) {
+ for (; adaptive->timeout > 0; adaptive->timeout--) {
/*
* If the task was re-awoken, break out completely so we can
* reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
cpu_relax();
}

+ if (adaptive->timeout <= 0)
+ sleep = 1;
+
put_task_struct(adaptive->owner);

return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive)
get_task_struct(adaptive->owner);
}

+extern int rtlock_timeout;
+
#define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL, \
+ .timeout = rtlock_timeout, }

#else

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
#include <asm/stacktrace.h>
#endif

+#include "rtmutex_adaptive.h"
+
static int deprecated_sysctl_warning(struct __sysctl_args *args);

#if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtlock_timeout",
+ .data = &rtlock_timeout,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-21 15:59:00

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 09/14] adaptive mutexes

From: Peter W.Morreale <[email protected]>

This patch adds the adaptive spin lock busywait to rtmutexes. It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/Kconfig.preempt | 37 +++++++++++++++++++++++++++++++++++++
kernel/rtmutex.c | 44 ++++++++++++++++++++++++++------------------
kernel/rtmutex_adaptive.h | 32 ++++++++++++++++++++++++++++++--
kernel/sysctl.c | 10 ++++++++++
4 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index eebec19..d2b0daa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -223,6 +223,43 @@ config RTLOCK_DELAY
tunable at runtime via a sysctl. A setting of 0 (zero) disables
the adaptive algorithm entirely.

+config ADAPTIVE_RTMUTEX
+ bool "Adaptive real-time mutexes"
+ default y
+ depends on ADAPTIVE_RTLOCK
+ help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes. In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation. In the event the critical
+ section does not need to sleep, an additional gain in throughput
+ can be seen by avoiding the extra overhead of sleeping.
+
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm. It will spin unless it determines it must
+ sleep to avoid deadlock. This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+
+ If unsure, say Y
+
+config RTMUTEX_DELAY
+ int "Default delay (in loops) for adaptive mutexes"
+ range 0 10000000
+ depends on ADAPTIVE_RTMUTEX
+ default "3000"
+ help
+ This allows you to specify the maximum delay a task will use
+ to wait for a rt mutex before going to sleep. Note that that
+ although the delay is implemented as a preemptable loop, tasks
+ of like priority cannot preempt each other and this setting can
+ result in increased latencies.
+
+ The value is tunable at runtime via a sysctl. A setting of 0
+ (zero) disables the adaptive algorithm entirely.
+
config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a7423f..a7ed7b2 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -24,6 +24,10 @@
int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
#endif

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
/*
* lock->owner state tracking:
*
@@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
* Do the wakeup before the ownership change to give any spinning
* waiter grantees a headstart over the other threads that will
* trigger once owner changes.
+ *
+ * This may appear to be a race, but the barriers close the
+ * window.
*/
- if (!savestate)
- wake_up_process(pendowner);
- else {
- smp_mb();
- /*
- * This may appear to be a race, but the barriers close the
- * window.
- */
- if ((pendowner->state != TASK_RUNNING)
- && (pendowner->state != TASK_RUNNING_MUTEX))
+ smp_mb();
+ if ((pendowner->state != TASK_RUNNING)
+ && (pendowner->state != TASK_RUNNING_MUTEX)) {
+ if (!savestate)
+ wake_up_process(pendowner);
+ else
wake_up_process_mutex(pendowner);
}

@@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(&waiter);

/* adaptive_wait() returns 1 if we need to sleep */
- if (adaptive_wait(lock, &waiter, &adaptive)) {
+ if (adaptive_wait(lock, 0, &waiter, &adaptive)) {
update_current(TASK_UNINTERRUPTIBLE, &saved_state);
if (waiter.task)
schedule_rt_mutex(lock);
@@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
+ DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive);

debug_rt_mutex_init_waiter(&waiter);
waiter.task = NULL;
@@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(current->lock_depth >= 0))
saved_lock_depth = rt_release_bkl(lock, flags);

- set_current_state(state);
-
/* Setup the timer, when timeout != NULL */
if (unlikely(timeout))
hrtimer_start(&timeout->timer, timeout->timer.expires,
@@ -1049,6 +1051,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(ret))
break;
}
+
+ mutex_prepare_adaptive_wait(lock, &adaptive);
+
saved_flags = current->flags & PF_NOSCHED;
current->flags &= ~PF_NOSCHED;

@@ -1056,17 +1061,20 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,

debug_rt_mutex_print_deadlock(&waiter);

- if (waiter.task)
- schedule_rt_mutex(lock);
+ if (mutex_adaptive_wait(lock,
+ (state == TASK_INTERRUPTIBLE),
+ &waiter, &adaptive)) {
+ set_current_state(state);
+ if (waiter.task)
+ schedule_rt_mutex(lock);
+ set_current_state(TASK_RUNNING);
+ }

spin_lock_irq(&lock->wait_lock);

current->flags |= saved_flags;
- set_current_state(state);
}

- set_current_state(TASK_RUNNING);
-
if (unlikely(waiter.task))
remove_waiter(lock, &waiter, flags);

diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index b7e282b..72f8def 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -56,7 +56,8 @@ struct adaptive_waiter {
*
*/
static inline int
-adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
+adaptive_wait(struct rt_mutex *lock, int interruptible,
+ struct rt_mutex_waiter *waiter,
struct adaptive_waiter *adaptive)
{
int sleep = 0;
@@ -77,6 +78,14 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
if (adaptive->owner != rt_mutex_owner(lock))
break;

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+ /*
+ * Mutexes may need to check for signals...
+ */
+ if (interruptible && signal_pending(current))
+ break;
+#endif
+
/*
* If we got here, presumably the lock ownership is still
* current. We will use it to our advantage to be able to
@@ -132,10 +141,29 @@ extern int rtlock_timeout;

#define DECLARE_ADAPTIVE_WAITER(name)

-#define adaptive_wait(lock, waiter, busy) 1
+#define adaptive_wait(lock, intr, waiter, busy) 1
#define prepare_adaptive_wait(lock, busy) {}

#endif /* CONFIG_ADAPTIVE_RTLOCK */

+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+
+#define mutex_adaptive_wait adaptive_wait
+#define mutex_prepare_adaptive_wait prepare_adaptive_wait
+
+extern int rtmutex_timeout;
+
+#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
+ struct adaptive_waiter name = { .owner = NULL, \
+ .timeout = rtmutex_timeout, }
+
+#else
+
+#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
+
+#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
+#define mutex_prepare_adaptive_wait(lock, busy) {}
+
+#endif /* CONFIG_ADAPTIVE_RTMUTEX */

#endif /* __KERNEL_RTMUTEX_ADAPTIVE_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 36259e4..3465af2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -976,6 +976,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtmutex_timeout",
+ .data = &rtmutex_timeout,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-21 15:59:38

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 10/14] adjust pi_lock usage in wakeup

From: Peter W.Morreale <[email protected]>

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner. We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/rtmutex.c | 14 +++++++++-----
1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a7ed7b2..122f143 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+ struct rt_mutex_waiter *next;

spin_lock(&current->pi_lock);

@@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
* waiter with higher priority than pending-owner->normal_prio
* is blocked on the unboosted (pending) owner.
*/
+
+ if (rt_mutex_has_waiters(lock))
+ next = rt_mutex_top_waiter(lock);
+ else
+ next = NULL;
+
spin_lock(&pendowner->pi_lock);

WARN_ON(!pendowner->pi_blocked_on);
@@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)

pendowner->pi_blocked_on = NULL;

- if (rt_mutex_has_waiters(lock)) {
- struct rt_mutex_waiter *next;
-
- next = rt_mutex_top_waiter(lock);
+ if (next)
plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
- }
+
spin_unlock(&pendowner->pi_lock);
}

2008-02-21 16:00:08

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

Decorate the printk path with an "unlikely()"

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/rtmutex.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
{
/* Temporary HACK! */
- if (!current->in_printk)
- might_sleep();
- else if (in_atomic() || irqs_disabled())
+ if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;

+ might_sleep();
+
if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
rt_mutex_deadlock_account_lock(lock, current);
else
@@ -677,7 +677,7 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
{
/* Temporary HACK! */
- if (current->in_printk && (in_atomic() || irqs_disabled()))
+ if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;

2008-02-21 16:00:49

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 12/14] remove the extra call to try_to_take_lock

From: Peter W. Morreale <[email protected]>

Remove the redundant attempt to get the lock. While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation.

Signed-off-by: Peter W. Morreale <[email protected]>
---

kernel/rtmutex.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ebdaa17..95c3644 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -718,12 +718,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(&lock->wait_lock, flags);
init_lists(lock);

- /* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
- spin_unlock_irqrestore(&lock->wait_lock, flags);
- return;
- }
-
BUG_ON(rt_mutex_owner(lock) == current);

/*

2008-02-21 16:01:16

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 13/14] allow rt-mutex lock-stealing to include lateral priority

The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority. The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes. But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block. However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio). This helps to keep the system "primed" with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/Kconfig.preempt | 10 ++++++++++
kernel/rtmutex.c | 31 +++++++++++++++++++++++--------
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2b0daa..343b93c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -273,3 +273,13 @@ config SPINLOCK_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+config RTLOCK_LATERAL_STEAL
+ bool "Allow equal-priority rtlock stealing"
+ default y
+ depends on PREEMPT_RT
+ help
+ This option alters the rtlock lock-stealing logic to allow
+ equal priority tasks to preempt a pending owner in addition
+ to higher priority tasks. This allows for a significant
+ boost in throughput under certain circumstances at the expense
+ of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 95c3644..da077e5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -323,12 +323,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
return ret;
}

+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+ if (current->prio >= pendowner->prio)
+#else
+ if (current->prio > pendowner->prio)
+ return 0;
+
+ if (!unfair && (current->prio == pendowner->prio))
+#endif
+ return 0;
+
+ return 1;
+}
+
/*
* Optimization: check if we can steal the lock from the
* assigned pending owner [which might not have taken the
* lock yet]:
*/
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -340,7 +355,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;

spin_lock(&pendowner->pi_lock);
- if (current->prio >= pendowner->prio) {
+ if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(&pendowner->pi_lock);
return 0;
}
@@ -393,7 +408,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
*
* Must be called with lock->wait_lock held.
*/
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
{
/*
* We have to be careful here if the atomic speedups are
@@ -416,7 +431,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
mark_rt_mutex_waiters(lock);

- if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+ if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair))
return 0;

/* We got the lock. */
@@ -737,7 +752,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current->lock_depth;

/* Try to acquire the lock */
- if (try_to_take_rt_mutex(lock))
+ if (try_to_take_rt_mutex(lock, 1))
break;
/*
* waiter.task is NULL the first time we come here and
@@ -985,7 +1000,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);

/* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
+ if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(&lock->wait_lock, flags);
return 0;
}
@@ -1006,7 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_flags;

/* Try to acquire the lock: */
- if (try_to_take_rt_mutex(lock))
+ if (try_to_take_rt_mutex(lock, 0))
break;

/*
@@ -1120,7 +1135,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)

init_lists(lock);

- ret = try_to_take_rt_mutex(lock);
+ ret = try_to_take_rt_mutex(lock, 0);
/*
* try_to_take_rt_mutex() sets the lock waiters
* bit unconditionally. Clean this up.

2008-02-21 16:01:43

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing

From: Sven-Thorsten Dietrich <[email protected]>

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich <[email protected]>
---

kernel/rtmutex.c | 8 ++++++--
kernel/sysctl.c | 14 ++++++++++++++
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index da077e5..62e7af5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
#ifdef CONFIG_ADAPTIVE_RTMUTEX
int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
#endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif

/*
* lock->owner state tracking:
@@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
if (current->prio > pendowner->prio)
return 0;

- if (!unfair && (current->prio == pendowner->prio))
+ if (unlikely(current->prio == pendowner->prio) &&
+ !(unfair && rtmutex_lateral_steal))
#endif
return 0;

@@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
return 1;

spin_lock(&pendowner->pi_lock);
- if (!lock_is_stealable(pendowner, unfair)) {
+ if (!lock_is_stealable(pendowner, (unfair & rtmutex_lateral_steal))) {
spin_unlock(&pendowner->pi_lock);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3465af2..c1a1c6d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[];
int sysctl_legacy_va_layout;
#endif

+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
extern int prove_locking;
extern int lock_stat;

@@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rtmutex_lateral_steal",
+ .data = &rtmutex_lateral_steal,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_PROC_FS
{
.ctl_name = CTL_UNNUMBERED,

2008-02-21 16:13:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

>>> On Thu, Feb 21, 2008 at 10:26 AM, in message
<[email protected]>, Gregory Haskins
<[email protected]> wrote:

> We have put together some data from different types of benchmarks for
> this patch series, which you can find here:
>
> ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

For convenience, I have also places a tarball of the entire series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2

Regards,
-Greg

2008-02-21 16:42:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote:

> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
> void fastcall (*slowfn)(struct rt_mutex *lock))
> {
> /* Temporary HACK! */
> - if (!current->in_printk)
> - might_sleep();
> - else if (in_atomic() || irqs_disabled())
> + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))

I have my doubts that gcc will honor unlikelies that don't affect
the complete condition of an if.

Also conditions guarding returns are by default predicted unlikely
anyways AFAIK.

The patch is likely a nop.

-Andi

2008-02-21 16:42:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism


> +config RTLOCK_DELAY
> + int "Default delay (in loops) for adaptive rtlocks"
> + range 0 1000000000
> + depends on ADAPTIVE_RTLOCK

I must say I'm not a big fan of putting such subtle configurable numbers
into Kconfig. Compilation is usually the wrong place to configure
such a thing. Just having it as a sysctl only should be good enough.

> + default "10000"

Perhaps you can expand how you came up with that default number?
It looks suspiciously round and worse the actual spin time depends a lot on the
CPU frequency (so e.g. a 3Ghz CPU will likely behave quite
differently from a 2Ghz CPU) Did you experiment with other spin times?
Should it be scaled with number of CPUs? And at what point is real
time behaviour visibly impacted?

Most likely it would be better to switch to something that is more
absolute time, like checking RDTSC every few iteration similar to what
udelay does. That would be at least constant time.

-Andi

2008-02-21 16:49:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH [RT] 10/14] adjust pi_lock usage in wakeup


On Thu, 21 Feb 2008, Gregory Haskins wrote:

> From: Peter W.Morreale <[email protected]>
>
> In wakeup_next_waiter(), we take the pi_lock, and then find out whether
> we have another waiter to add to the pending owner. We can reduce
> contention on the pi_lock for the pending owner if we first obtain the
> pointer to the next waiter outside of the pi_lock.
>
> This patch adds a measureable increase in throughput.

I see how this may decrease contention (slightly less time in holding the
pi_lock). But, please, when stating something like: "adds a measurable
increase in throughput", show the benchmark numbers.

-- Steve

2008-02-21 16:54:30

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

>>> On Thu, Feb 21, 2008 at 11:36 AM, in message <[email protected]>,
Andi Kleen <[email protected]> wrote:
> On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote:
>
>> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
>> void fastcall (*slowfn)(struct rt_mutex *lock))
>> {
>> /* Temporary HACK! */
>> - if (!current->in_printk)
>> - might_sleep();
>> - else if (in_atomic() || irqs_disabled())
>> + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
>
> I have my doubts that gcc will honor unlikelies that don't affect
> the complete condition of an if.
>
> Also conditions guarding returns are by default predicted unlikely
> anyways AFAIK.
>
> The patch is likely a nop.
>

Yeah, you are probably right. We have found that the system is *extremely* touchy on how much overhead we have in the lock-acquisition path. For instance, using a non-inline version of adaptive_wait() can cost 5-10% in disk-io throughput. So we were trying to find places to shave anywhere we could. That being said, I didn't record any difference from this patch, so you are probably exactly right. It just seemed like "the right thing to do" so I left it in.

-Greg


2008-02-21 17:09:23

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

>>> On Thu, Feb 21, 2008 at 11:41 AM, in message <[email protected]>,
Andi Kleen <[email protected]> wrote:

>> +config RTLOCK_DELAY
>> + int "Default delay (in loops) for adaptive rtlocks"
>> + range 0 1000000000
>> + depends on ADAPTIVE_RTLOCK
>
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
>
>> + default "10000"
>
> Perhaps you can expand how you came up with that default number?

Actually, the number doesn't seem to matter that much as long as it is sufficiently long enough to make timeouts rare. Most workloads will present some threshold for hold-time. You generally get the best performance if the value is at least as "long" as that threshold. Anything beyond that and there is no gain, but there doesn't appear to be a penalty either. So we picked 10000 because we found it to fit that criteria quite well for our range of GHz class x86 machines. YMMY, but that is why its configurable ;)

> It looks suspiciously round and worse the actual spin time depends a lot on
> the
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite
> differently from a 2Ghz CPU)

Yeah, fully agree. We really wanted to use a time-value here but ran into various problems that have yet to be resolved. We have it on the todo list to express this in terms in ns so it at least will scale with the architecture.

> Did you experiment with other spin times?

Of course ;)

> Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research "todo".

> And at what point is real
> time behaviour visibly impacted?

Well, if we did our jobs correctly, RT behavior should *never* be impacted. *Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that affects the probability of timing out. Values lower than that threshold start to degrade throughput. Values higher than that have no affect on throughput, but may drive the cpu utilization higher which can theoretically impact tasks of equal or lesser priority by taking that resource away from them. To date, we have not observed any real-world implications of this however.

>
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.

I agree. We need to move in the direction of time-basis. The tradeoff is that it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). I think one of the (not-included) patches converts a nanosecond value from the sysctl to approximate loop-counts using the bogomips data. This was a decent compromise between the non-scaling loopcounts and the heavy-weight official timing APIs. We dropped it because we support older kernels which were conflicting with the patch. We may have to resurrect it, however..

-Greg


2008-02-21 17:09:43

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 10/14] adjust pi_lock usage in wakeup

On Thu, 2008-02-21 at 11:48 -0500, Steven Rostedt wrote:
> On Thu, 21 Feb 2008, Gregory Haskins wrote:
>
> > From: Peter W.Morreale <[email protected]>
> >
> > In wakeup_next_waiter(), we take the pi_lock, and then find out whether
> > we have another waiter to add to the pending owner. We can reduce
> > contention on the pi_lock for the pending owner if we first obtain the
> > pointer to the next waiter outside of the pi_lock.
> >
> > This patch adds a measureable increase in throughput.
>
> I see how this may decrease contention (slightly less time in holding the
> pi_lock). But, please, when stating something like: "adds a measurable
> increase in throughput", show the benchmark numbers.
>
> -- Steve
>

Approximately 3% to the dbench benchmark we used.

My "standard" sanity check was to mount a ramfs filesystem and execute:

dbench -t 10 30

five times and generate an average from the reported "Throughput"
numbers displayed by the runs.

dbench was chosen merely because of the contention on dcache_lock.

Best,
-PWM

2008-02-21 17:25:08

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
> > +config RTLOCK_DELAY
> > + int "Default delay (in loops) for adaptive rtlocks"
> > + range 0 1000000000
> > + depends on ADAPTIVE_RTLOCK
>
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
>
> > + default "10000"
>
> Perhaps you can expand how you came up with that default number?
> It looks suspiciously round and worse the actual spin time depends a lot on the
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite
> differently from a 2Ghz CPU) Did you experiment with other spin times?
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted?
>

We did play with a number of different values at first, however this
needs more inspection. With the loads we have played with, this
parameter is not that sensitive.

At first we only had the adaptive wait added to the rt spin lock code,
and the values chosen did not seem to have much of an impact until we
dropped below 1000. Higher values had little impact. Adding the
adaptive code to rt mutexes seems to make this (and the similar mutex
parameter) more sensitive, however this is likely due to the particular
benchmarks we were using.

Note that the adaptive wait loop breaks on every owner change, so the
actual time spent spinning on a CPU (in that loop) is, in general,
limited to the time the lock is held. (Note the other cases in that
adaptive wait loop as well)

It seems to be that the higher values help in the cases where the lock
is stolen from the pending owner, in which case the pending owner can
and will iterate and again start spinning.

We are certainly hoping that the community will test and also provide
feedback.


> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.
>
> -Andi

We (at least I :-) did try that, to the detriment of throughput. It
is seemingly overkill to more precisely account for the "max" time spent
spinning.

At one point, I did have a patch that specified the sysctl as
"nanoseconds" and converted that to a approximate loop count (via the
math in udelay()). The idea was that nanoseconds are more human
friendly, as well as hardware independent, however on further review,
the idea was dropped.


Best,
-PWM

2008-02-21 17:27:00

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism


On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
> > +config RTLOCK_DELAY
> > + int "Default delay (in loops) for adaptive rtlocks"
> > + range 0 1000000000
> > + depends on ADAPTIVE_RTLOCK
>
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
>
> > + default "10000"
>
> Perhaps you can expand how you came up with that default number?

We did not want to create a hotspot around time sources for HRT and the
scheduler clock, and there really hasn't been enough analyis.

The loop should be calibrated using something similar to
loops_per_jiffy, and it should be in nanoseconds.

It needs to be tunable, because it depends a lot on the workload.

High frequency periodic tasks would need a lower setting - but it also
relates to the number of waiting tasks and the number of CPUs, so there
may be heuristics that tie into lockstat.

For big-SMP systems, it may actually be worth the overhead to track
these stats per-lock (for the hot locks), if we can correlate it all to
performance.

> It looks suspiciously round and worse the actual spin time depends a lot on the
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite
> differently from a 2Ghz CPU) Did you experiment with other spin times?
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted?
>

The code actually runs preemptibly, so even before the timeout expires,
the task can pop off the CPU (at which point another state chance
cancels the loop)

> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.
>

True - I looked at something generic, similar to what RT's latency
tracing uses, allowing for other architectures.

Sven

> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-02-21 21:26:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks


hm. Why is the ticket spinlock patch included in this patchset? It just
skews your performance results unnecessarily. Ticket spinlocks are
independent conceptually, they are already upstream in 2.6.25-rc2 and
-rt will have them automatically once we rebase to .25.

and if we take the ticket spinlock patch out of your series, the the
size of the patchset shrinks in half and touches only 200-300 lines of
code ;-) Considering the total size of the -rt patchset:

652 files changed, 23830 insertions(+), 4636 deletions(-)

we can regard it a routine optimization ;-)

regarding the concept: adaptive mutexes have been talked about in the
past, but their advantage is not at all clear, that's why we havent done
them. It's definitely not an unambigiously win-win concept.

So lets get some real marketing-free benchmarking done, and we are not
just interested in the workloads where a bit of polling on contended
locks helps, but we are also interested in workloads where the polling
hurts ... And lets please do the comparisons without the ticket spinlock
patch ...

Ingo

2008-02-21 21:33:47

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

It would also help to get the lockdep/lockstat output for those runs
so that more discussion can happen. That framework can be extended to
do all sorts of contention tracking and that is why I took implemented
it in the first place to track the viability of adaptive spins. My
initial results where that about 10 percent (heavier against BKL)
where suitable for adaptive spins and 10 percent roughly for ownership
stealing. The current implementation didn't seem to have those
measurements coded in just yet (unlike my private implementation).

I came to the original conclusion that it wasn't originally worth it,
but the dbench number published say otherwise. It would be a good
thing to get both a precise chunk of instrumentation to show where the
problems are (these number could be skewed by a number of things) as
well as the specific locks that are contended against.

See ya

bill

On Thu, Feb 21, 2008 at 1:24 PM, Ingo Molnar <[email protected]> wrote:
> regarding the concept: adaptive mutexes have been talked about in the
> past, but their advantage is not at all clear, that's why we havent done
> them. It's definitely not an unambigiously win-win concept.
>
> So lets get some real marketing-free benchmarking done, and we are not
> just interested in the workloads where a bit of polling on contended
> locks helps, but we are also interested in workloads where the polling
> hurts ... And lets please do the comparisons without the ticket spinlock
> patch ...

2008-02-21 21:47:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

>>> On Thu, Feb 21, 2008 at 4:24 PM, in message <[email protected]>,
Ingo Molnar <[email protected]> wrote:

> hm. Why is the ticket spinlock patch included in this patchset? It just
> skews your performance results unnecessarily. Ticket spinlocks are
> independent conceptually, they are already upstream in 2.6.25-rc2 and
> -rt will have them automatically once we rebase to .25.

Sorry if it was ambiguous. I included them because we found the patch series without them can cause spikes due to the newly introduced pressure on the (raw_spinlock_t)lock->wait_lock. You can run the adaptive-spin patches without them just fine (in fact, in many cases things run faster without them....dbench *thrives* on chaos). But you may also measure a cyclic-test spike if you do so. So I included them to present a "complete package without spikes". I tried to explain that detail in the prologue, but most people probably fell asleep before they got to the end ;)

>
> and if we take the ticket spinlock patch out of your series, the the
> size of the patchset shrinks in half and touches only 200-300 lines of
> code ;-) Considering the total size of the -rt patchset:
>
> 652 files changed, 23830 insertions(+), 4636 deletions(-)
>
> we can regard it a routine optimization ;-)

Its not the size of your LOC, but what you do with it :)

>
> regarding the concept: adaptive mutexes have been talked about in the
> past, but their advantage is not at all clear, that's why we havent done
> them. It's definitely not an unambigiously win-win concept.
>
> So lets get some real marketing-free benchmarking done, and we are not
> just interested in the workloads where a bit of polling on contended
> locks helps, but we are also interested in workloads where the polling
> hurts ... And lets please do the comparisons without the ticket spinlock
> patch ...

I'm open to suggestion, and this was just a sample of the testing we have done. We have thrown plenty of workloads at this patch series far beyond the slides I prepared in that URL, and they all seem to indicate a net positive improvement so far. Some of those results I cannot share due to NDA, and some I didnt share simply because I never formally collected the data like I did for these tests. If there is something you would like to see, please let me know and I will arrange for it to be executed if at all possible.

Regards,
-Greg

2008-02-21 22:03:22

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

>>> On Thu, Feb 21, 2008 at 4:42 PM, in message <[email protected]>,
Ingo Molnar <[email protected]> wrote:

> * Bill Huey (hui) <[email protected]> wrote:
>
>> I came to the original conclusion that it wasn't originally worth it,
>> but the dbench number published say otherwise. [...]
>
> dbench is a notoriously unreliable and meaningless workload. It's being
> frowned upon by the VM and the IO folks.

I agree...its a pretty weak benchmark. BUT, it does pound on dcache_lock and therefore was a good demonstration of the benefits of lower-contention overhead. Also note we also threw other tests in that PDF if you scroll to the subsequent pages.

> If that's the only workload
> where spin-mutexes help, and if it's only a 3% improvement [of which it
> is unclear how much of that improvement was due to ticket spinlocks],
> then adaptive mutexes are probably not worth it.

Note that the "3%" figure being thrown around was from a single patch within the series. We are actually getting a net average gain of 443% in dbench. And note that the number goes *up* when you remove the ticketlocks. The ticketlocks are there to prevent latency spikes, not improve throughput.

Also take a look at the hackbench numbers which are particularly promising. We get a net average gain of 493% faster for RT10 based hackbench runs. The kernel build was only a small gain, but it was all gain nonetheless. We see similar results for any other workloads we throw at this thing. I will gladly run any test requested to which I have the ability to run, and I would encourage third party results as well.


>
> I'd not exclude them fundamentally though, it's really the numbers that
> matter. The code is certainly simple enough (albeit the .config and
> sysctl controls are quite ugly and unacceptable - adaptive mutexes
> should really be ... adaptive, with no magic constants in .configs or
> else).

We can clean this up, per your suggestions.

>
> But ... i'm somewhat sceptic, after having played with spin-a-bit
> mutexes before.

Its very subtle to get this concept to work. The first few weeks, we were getting 90% regressions ;) Then we had a breakthrough and started to get this thing humming along quite nicely.

Regards,
-Greg



2008-02-21 22:13:06

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

On Thu, 2008-02-21 at 22:24 +0100, Ingo Molnar wrote:
> hm. Why is the ticket spinlock patch included in this patchset? It just
> skews your performance results unnecessarily. Ticket spinlocks are
> independent conceptually, they are already upstream in 2.6.25-rc2 and
> -rt will have them automatically once we rebase to .25.
>

True, the ticket spinlock certainly adds to the throughput results we
have seen. However, the results without the ticket patch are still very
significant. (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We
can easily re-gen the previous results for an apples-to-apples
comparison.

Without the ticket locks we see spikes in cyclictest that have been
mapped (via latency trace) to the non-deterministic behavior of the raw
rt lock "wait_lock" (which is used in the rt lock code). These spikes
disappear with the ticket lock patch installed.

Note that we see these spikes while running cyclictest concurrently with
dbench/hackbench/kernel builds.

> and if we take the ticket spinlock patch out of your series, the the
> size of the patchset shrinks in half and touches only 200-300 lines of
> code ;-) Considering the total size of the -rt patchset:
>
> 652 files changed, 23830 insertions(+), 4636 deletions(-)
>
> we can regard it a routine optimization ;-)
>
> regarding the concept: adaptive mutexes have been talked about in the
> past, but their advantage is not at all clear, that's why we havent done
> them. It's definitely not an unambigiously win-win concept.
>

Can you elaborate? Where did you previously see that this would be
problematic?

> So lets get some real marketing-free benchmarking done, and we are not
> just interested in the workloads where a bit of polling on contended
> locks helps, but we are also interested in workloads where the polling
> hurts ... And lets please do the comparisons without the ticket spinlock
> patch ...
>
> Ingo

Certainly. One of the things we have struggled with are real-world
loads that prove/disprove the implementation. Thus far, everything we
have come up with has only seen improvement.

The worst case we have come up with so far is running dbench/hackbench
in FIFO. Here we have multiple tasks at the same priority contending on
the same locks. This should give a worst case performance since the
tasks cannot preempt each other, and the nesting of locks should produce
significant latencies. However, with the addition of the lateral-steal
patch, we do not witness that at all, rather the throughput is on par
with CFS results.

Higher priority FIFO tasks will preempt the polling tasks, and the lower
priority tasks will wait, as they should.

Are there some suggestions on workloads that would show how the polling
hurts?

Thanks,
-PWM

2008-02-21 22:42:25

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

On Thu, 2008-02-21 at 15:12 -0700, Peter W. Morreale wrote:

> True, the ticket spinlock certainly adds to the throughput results we
> have seen. However, the results without the ticket patch are still very
> significant. (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We
> can easily re-gen the previous results for an apples-to-apples
> comparison.
>

I misspoke in the above. Greg is correct. Without the ticket lock
patch we see an *improvement* in throughput, not a decrease.

Sorry for any confusion.

Best,
-PWM

2008-02-21 22:54:31

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

On Thu, Feb 21, 2008 at 1:42 PM, Ingo Molnar <[email protected]> wrote:
> I'd not exclude them fundamentally though, it's really the numbers that
> matter. The code is certainly simple enough (albeit the .config and
> sysctl controls are quite ugly and unacceptable - adaptive mutexes
> should really be ... adaptive, with no magic constants in .configs or
> else).
>
> But ... i'm somewhat sceptic, after having played with spin-a-bit
> mutexes before.

Yeah, it's yet another reason to get better instrumentation in -rt for
this purpose and why it's important to only log things in the slow
path so that the front end of the rtmutex doesn't see instrumentation
artifacts in the measurement. If I get a chance tonight, I'll extend
lockdep/lockstat for this purpose if peterz doesn't beat me to it. I
won't have access to the test harness to replicate this though.

bill

2008-02-22 13:29:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep

Gregory Haskins wrote:
> @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>
> debug_rt_mutex_print_deadlock(&waiter);
>
> - schedule_rt_mutex(lock);
> + update_current(TASK_UNINTERRUPTIBLE, &saved_state);

I have a question for everyone out there about this particular part of
the code. Patch 6/14 adds an optimization that is predicated on the
order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the
waiter.task below.

My assumption is that the xchg() (inside update_current()) acts as an
effective wmb(). If xchg() does not have this property, then this code
is broken and patch 6/14 should also add a:


+ smp_wmb();


> + if (waiter.task)
> + schedule_rt_mutex(lock);
> + else
> + update_current(TASK_RUNNING_MUTEX, &saved_state);
>
> spin_lock_irqsave(&lock->wait_lock, flags);
> current->flags |= saved_flags;
> current->lock_depth = saved_lock_depth;
> - state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
> - if (unlikely(state == TASK_RUNNING))
> - saved_state = TASK_RUNNING;


Does anyone know the answer to this?

Regards,
-Greg

2008-02-22 13:35:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep


On Fri, 22 Feb 2008, Gregory Haskins wrote:

> Gregory Haskins wrote:
> > @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> >
> > debug_rt_mutex_print_deadlock(&waiter);
> >
> > - schedule_rt_mutex(lock);
> > + update_current(TASK_UNINTERRUPTIBLE, &saved_state);
>
> I have a question for everyone out there about this particular part of
> the code. Patch 6/14 adds an optimization that is predicated on the
> order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the
> waiter.task below.
>
> My assumption is that the xchg() (inside update_current()) acts as an
> effective wmb(). If xchg() does not have this property, then this code
> is broken and patch 6/14 should also add a:
>
>
> + smp_wmb();

I believe that the wmb would be needed. I doubt that xchg on all archs
would force any ordering of reads and writes. It only needs to guarantee the
atomic nature of the data exchange. I don't see any reason that it would
imply any type of memory barrier.

-- Steve


>
>
> > + if (waiter.task)
> > + schedule_rt_mutex(lock);
> > + else
> > + update_current(TASK_RUNNING_MUTEX, &saved_state);
> >
> > spin_lock_irqsave(&lock->wait_lock, flags);
> > current->flags |= saved_flags;
> > current->lock_depth = saved_lock_depth;
> > - state = xchg(&current->state, TASK_UNINTERRUPTIBLE);
> > - if (unlikely(state == TASK_RUNNING))
> > - saved_state = TASK_RUNNING;
>
>
> Does anyone know the answer to this?
>
> Regards,
> -Greg
>

2008-02-22 13:36:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep


* Gregory Haskins <[email protected]> wrote:

> My assumption is that the xchg() (inside update_current()) acts as an
> effective wmb(). If xchg() does not have this property, then this
> code is broken and patch 6/14 should also add a:

xchg() is a strong implicit memory barrier, it implies smp_mb().
(historic sidenote: it was the very first SMP primitive we had in
Linux.)

Ingo

2008-02-22 13:41:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep


On Fri, 2008-02-22 at 08:35 -0500, Steven Rostedt wrote:

> > My assumption is that the xchg() (inside update_current()) acts as an
> > effective wmb(). If xchg() does not have this property, then this code
> > is broken and patch 6/14 should also add a:
> >
> >
> > + smp_wmb();
>
> I believe that the wmb would be needed. I doubt that xchg on all archs
> would force any ordering of reads and writes. It only needs to guarantee the
> atomic nature of the data exchange. I don't see any reason that it would
> imply any type of memory barrier.

Documentation/memory-barriers.txt states:

Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
(smp_mb()) on each side of the actual operation (with the exception of
explicit lock operations, described later). These include:

xchg();


2008-02-22 13:43:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep


On Fri, 22 Feb 2008, Ingo Molnar wrote:

>
> * Gregory Haskins <[email protected]> wrote:
>
> > My assumption is that the xchg() (inside update_current()) acts as an
> > effective wmb(). If xchg() does not have this property, then this
> > code is broken and patch 6/14 should also add a:
>
> xchg() is a strong implicit memory barrier, it implies smp_mb().
> (historic sidenote: it was the very first SMP primitive we had in
> Linux.)

OK, I've been proven wrong ;-)

I was just thinking of how an arch would implement it. No need for memory
barriers in just an xchg. But if Linux "implies it" then that's another
story.

Thanks,

-- Steve

/me learns something new everyday.

2008-02-22 13:46:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep


Gregory,

I guess we should have just read Documentation/memory-barriers.text

Here's the snippet:


Any atomic operation that modifies some state in memory and returns
information
about the state (old or new) implies an SMP-conditional general memory
barrier
(smp_mb()) on each side of the actual operation (with the exception of
explicit lock operations, described later). These include:

xchg();
cmpxchg();
atomic_cmpxchg();
atomic_inc_return();
atomic_dec_return();
atomic_add_return();
atomic_sub_return();
atomic_inc_and_test();
atomic_dec_and_test();
atomic_sub_and_test();
atomic_add_negative();
atomic_add_unless();
test_and_set_bit();
test_and_clear_bit();
test_and_change_bit();


-- Steve

2008-02-22 19:08:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Thu, Feb 21, 2008 at 05:41:09PM +0100, Andi Kleen wrote:
>
> > +config RTLOCK_DELAY
> > + int "Default delay (in loops) for adaptive rtlocks"
> > + range 0 1000000000
> > + depends on ADAPTIVE_RTLOCK
>
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
>
> > + default "10000"
>
> Perhaps you can expand how you came up with that default number?
> It looks suspiciously round and worse the actual spin time depends a lot on the
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite
> differently from a 2Ghz CPU) Did you experiment with other spin times?
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted?
>
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.

One approach would be to set the RTLOCK_DELAY parameter to something like
-1 for default, and to set it to the number of cycles required for about
10 cache misses at boot time. This would automatically scale with CPU
frequency and memory latency.

Thanx, Paul

2008-02-22 19:19:58

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, Feb 22, 2008 at 11:08 AM, Paul E. McKenney
<[email protected]> wrote:
> One approach would be to set the RTLOCK_DELAY parameter to something like
> -1 for default, and to set it to the number of cycles required for about
> 10 cache misses at boot time. This would automatically scale with CPU
> frequency and memory latency.

Yeah, I'm not very keen on having a constant there without some
contention instrumentation to see how long the spins are. It would be
better to just let it run until either task->on_cpu is off or checking
if the "current" in no longer matches the mutex owner for the runqueue
in question. At that point, you know the thread isn't running.
Spinning on something like that is just a waste of time. It's for that
reason that doing in the spin outside of a preempt critical section
isn't really needed

bill

2008-02-22 19:22:37

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[email protected]> wrote:
> Yeah, I'm not very keen on having a constant there without some
> contention instrumentation to see how long the spins are. It would be
> better to just let it run until either task->on_cpu is off or checking
> if the "current" in no longer matches the mutex owner for the runqueue
> in question. At that point, you know the thread isn't running.
> Spinning on something like that is just a waste of time. It's for that
> reason that doing in the spin outside of a preempt critical section
> isn't really needed

Excuse me, I meant to say "...isn't a problem".

bill

2008-02-22 19:23:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

Hi!

> Decorate the printk path with an "unlikely()"
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> kernel/rtmutex.c | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 122f143..ebdaa17 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
> void fastcall (*slowfn)(struct rt_mutex *lock))
> {
> /* Temporary HACK! */
> - if (!current->in_printk)
> - might_sleep();
> - else if (in_atomic() || irqs_disabled())
> + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
> /* don't grab locks for printk in atomic */
> return;
>
> + might_sleep();

I think you changed the code here... you call might_sleep() in
different cases afaict.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-22 19:23:35

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH [RT] 07/14] adaptive real-time lock support

Hi!

> @@ -192,6 +192,26 @@ config RCU_TRACE
> Say Y/M here if you want to enable RCU tracing in-kernel/module.
> Say N if you are unsure.
>
> +config ADAPTIVE_RTLOCK
> + bool "Adaptive real-time locks"
> + default y

tabs vs. spaces.

> + If unsure, say Y

Missing dot?

> @@ -7,6 +7,7 @@
> * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[email protected]>
> * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
> * Copyright (C) 2006 Esben Nielsen
> + * Copyright (C) 2008 Novell, Inc.

It would be nice to list some names, too.

> --- /dev/null
> +++ b/kernel/rtmutex_adaptive.h
> @@ -0,0 +1,134 @@
> +/*
> + * Adaptive RT lock support
> + *
> + * There are pros and cons when deciding between the two basic forms of
> + * locking primitives (spinning vs sleeping). Without going into great
> + * detail on either one, we note that spinlocks have the advantage of
> + * lower overhead for short hold locks. However, they also have a
> + * con in that they create indeterminate latencies since preemption
> + * must traditionally be disabled while the lock is held (to prevent deadlock).
> + *
> + * We want to avoid non-deterministic critical sections in -rt. Therefore,
> + * when realtime is enabled, most contexts are converted to threads, and
> + * likewise most spinlock_ts are converted to sleepable rt-mutex derived
> + * locks. This allows the holder of the lock to remain fully preemptible,
> + * thus reducing a major source of latencies in the kernel.
> + *
> + * However, converting what was once a true spinlock into a sleeping lock
> + * may also decrease performance since the locks will now sleep under
> + * contention. Since the fundamental lock used to be a spinlock, it is
> + * highly likely that it was used in a short-hold path and that release
> + * is imminent. Therefore sleeping only serves to cause context-thrashing.
> + *
> + * Adaptive RT locks use a hybrid approach to solve the problem. They
> + * spin when possible, and sleep when necessary (to avoid deadlock, etc).
> + * This significantly improves many areas of the performance of the -rt
> + * kernel.
> + *
> + * Copyright (C) 2008 Novell, Inc.,
> + * Sven Dietrich, Peter Morreale, and Gregory Haskins

GPL?

> +/*
> + * Adaptive-rtlocks will busywait when possible, and sleep only if
> + * necessary. Note that the busyloop looks racy, and it is....but we do
> + * not care. If we lose any races it simply means that we spin one more
> + * time before seeing that we need to break-out on the next iteration.
> + *
> + * We realize this is a relatively large function to inline, but note that
> + * it is only instantiated 1 or 2 times max, and it makes a measurable
> + * performance different to avoid the call.
> + *
> + * Returns 1 if we should sleep
> + *
> + */

Kerneldoc?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-22 19:24:45

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option

Hi!

> We introduce a configuration variable for the feature to make it easier for
> various architectures and/or configs to enable or disable it based on their
> requirements.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> kernel/Kconfig.preempt | 9 +++++++++
> kernel/spinlock.c | 7 +++----
> lib/Kconfig.debug | 1 +
> 3 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index 41a0d88..5b45213 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -86,6 +86,15 @@ config PREEMPT
> default y
> depends on PREEMPT_DESKTOP || PREEMPT_RT
>
> +config DISABLE_PREEMPT_SPINLOCK_WAITERS
> + bool
> + default n
> +
> +config PREEMPT_SPINLOCK_WAITERS
> + bool
> + default y
> + depends on PREEMPT && SMP && !DISABLE_PREEMPT_SPINLOCK_WAITERS

tabs vs. spaces problems?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-22 19:44:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[email protected]> wrote:
> > Yeah, I'm not very keen on having a constant there without some
> > contention instrumentation to see how long the spins are. It would be
> > better to just let it run until either task->on_cpu is off or checking
> > if the "current" in no longer matches the mutex owner for the runqueue
> > in question. At that point, you know the thread isn't running.
> > Spinning on something like that is just a waste of time. It's for that
> > reason that doing in the spin outside of a preempt critical section
> > isn't really needed
>
> Excuse me, I meant to say "...isn't a problem".

The fixed-time spins are very useful in cases where the critical section
is almost always very short but can sometimes be very long. In such
cases, you would want to spin until either ownership changes or it is
apparent that the current critical-section instance will be long.

I believe that there are locks in the Linux kernel that have this
"mostly short but sometimes long" hold-time property.

Thanx, Paul

2008-02-22 19:56:16

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism


On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
> On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[email protected]> wrote:
> > > Yeah, I'm not very keen on having a constant there without some
> > > contention instrumentation to see how long the spins are. It would be
> > > better to just let it run until either task->on_cpu is off or checking
> > > if the "current" in no longer matches the mutex owner for the runqueue
> > > in question. At that point, you know the thread isn't running.
> > > Spinning on something like that is just a waste of time. It's for that
> > > reason that doing in the spin outside of a preempt critical section
> > > isn't really needed
> >
> > Excuse me, I meant to say "...isn't a problem".
>
> The fixed-time spins are very useful in cases where the critical section
> is almost always very short but can sometimes be very long. In such
> cases, you would want to spin until either ownership changes or it is
> apparent that the current critical-section instance will be long.
>
> I believe that there are locks in the Linux kernel that have this
> "mostly short but sometimes long" hold-time property.

In regards to this "mostly short but sometimes long" question,
for very large SMP systems, running with some profiling enabled, might
allow the system to adapt to varying workloads and therefore shifting
lock contention / hold-times.

Overall utilization despite the overhead might be lower, but this is
tbd.

In high-contention, short-hold time situations, it may even make sense
to have multiple CPUs with multiple waiters spinning, depending on
hold-time vs. time to put a waiter to sleep and wake them up.

The wake-up side could also walk ahead on the queue, and bring up
spinners from sleeping, so that they are all ready to go when the lock
flips green for them.

But in more simple cases, there should be a simple, default timeout
governed by context switch overhead or as defined by a derived number of
cache misses, as you suggested.

Sven

> Thanx, Paul
> -
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-02-22 20:15:55

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
>
> The fixed-time spins are very useful in cases where the critical section
> is almost always very short but can sometimes be very long. In such
> cases, you would want to spin until either ownership changes or it is
> apparent that the current critical-section instance will be long.
>

This is what the adaptive patch set does. All current spinners exit the
loop on an owner change.

In this case each "spinner" will traverse the outer rt lock loop and
re-attempt to acquire the lock. If unsuccessful, the task will re-enter
the adaptive wait loop again, providing that there are still "timeout"
iterations left for that particular task.

I do have some stats on the rt lock/adaptive algorithm from a crude
patch used during development. As Bill has noted, the correct way to
obtain these would be to extend lockstats. It was just more expedient
to create this crude patch at the time. If it is helpful to look at
these results in the meantime, I can easily make them, as well as the
patch, available.


-PWM

2008-02-22 20:23:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, Feb 22, 2008 at 11:55:45AM -0800, Sven-Thorsten Dietrich wrote:
>
> On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> > > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[email protected]> wrote:
> > > > Yeah, I'm not very keen on having a constant there without some
> > > > contention instrumentation to see how long the spins are. It would be
> > > > better to just let it run until either task->on_cpu is off or checking
> > > > if the "current" in no longer matches the mutex owner for the runqueue
> > > > in question. At that point, you know the thread isn't running.
> > > > Spinning on something like that is just a waste of time. It's for that
> > > > reason that doing in the spin outside of a preempt critical section
> > > > isn't really needed
> > >
> > > Excuse me, I meant to say "...isn't a problem".
> >
> > The fixed-time spins are very useful in cases where the critical section
> > is almost always very short but can sometimes be very long. In such
> > cases, you would want to spin until either ownership changes or it is
> > apparent that the current critical-section instance will be long.
> >
> > I believe that there are locks in the Linux kernel that have this
> > "mostly short but sometimes long" hold-time property.
>
> In regards to this "mostly short but sometimes long" question,
> for very large SMP systems, running with some profiling enabled, might
> allow the system to adapt to varying workloads and therefore shifting
> lock contention / hold-times.
>
> Overall utilization despite the overhead might be lower, but this is
> tbd.
>
> In high-contention, short-hold time situations, it may even make sense
> to have multiple CPUs with multiple waiters spinning, depending on
> hold-time vs. time to put a waiter to sleep and wake them up.
>
> The wake-up side could also walk ahead on the queue, and bring up
> spinners from sleeping, so that they are all ready to go when the lock
> flips green for them.
>
> But in more simple cases, there should be a simple, default timeout
> governed by context switch overhead or as defined by a derived number of
> cache misses, as you suggested.

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.

Thanx, Paul

> Sven
>
> > Thanx, Paul
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-02-22 20:36:27

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
>
> In high-contention, short-hold time situations, it may even make sense
> to have multiple CPUs with multiple waiters spinning, depending on
> hold-time vs. time to put a waiter to sleep and wake them up.
>
> The wake-up side could also walk ahead on the queue, and bring up
> spinners from sleeping, so that they are all ready to go when the lock
> flips green for them.
>

I did try an attempt at this one time. The logic was merely if the
pending owner was running, wakeup the next waiter. The results were
terrible for the benchmarks used, as compared to the current
implementation.

What this meant was that virtually every unlock performed a wakeup, if
not for the new pending owner, than the next-in-line waiter.

My impression at the time was that the contention for the rq lock is
significant, regardless of even if the task being woken up was already
running.

I can generate numbers if that helps.

-PWM


2008-02-22 22:04:13

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

Paul E. McKenney wrote:
> Governing the timeout by context-switch overhead sounds even better to me.
> Really easy to calibrate, and short critical sections are of much shorter
> duration than are a context-switch pair.

Yeah, fully agree. This is on my research "todo" list. My theory is
that the ultimate adaptive-timeout algorithm here would essentially be
the following:

*) compute the context-switch pair time average for the system. This is
your time threshold (CSt).

*) For each lock, maintain an average hold-time (AHt) statistic (I am
assuming this can be done cheaply...perhaps not).

The adaptive code would work as follows:

if (AHt > CSt) /* dont even bother if the average is greater than CSt */
timeout = 0;
else
timeout = AHt;

if (adaptive_wait(timeout))
sleep();

Anyone have some good ideas on how to compute CSt? I was thinking you
could create two kthreads that message one another (measuring round-trip
time) for some number (say 100) to get an average. You could probably
just approximate it with flushing workqueue jobs.

-Greg

>
> Thanx, Paul
>
>> Sven
>>
>>> Thanx, Paul
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-02-22 22:25:34

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

Pavel Machek wrote:
> Hi!
>
>> Decorate the printk path with an "unlikely()"
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> kernel/rtmutex.c | 8 ++++----
>> 1 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index 122f143..ebdaa17 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
>> void fastcall (*slowfn)(struct rt_mutex *lock))
>> {
>> /* Temporary HACK! */
>> - if (!current->in_printk)
>> - might_sleep();
>> - else if (in_atomic() || irqs_disabled())
>> + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
>> /* don't grab locks for printk in atomic */
>> return;
>>
>> + might_sleep();
>
> I think you changed the code here... you call might_sleep() in
> different cases afaict.

Agreed, but it's still correct afaict. I added an extra might_sleep()
to a path that really might sleep. I should have mentioned that in the
header.

In any case, its moot. Andi indicated this patch is probably a no-op so
I was considering dropping it on the v2 pass.

Regards,
-Greg


2008-02-23 00:54:19

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

On Fri, Feb 22, 2008 at 2:20 PM, Gregory Haskins
<[email protected]> wrote:
> Agreed, but it's still correct afaict. I added an extra might_sleep()
> to a path that really might sleep. I should have mentioned that in the
> header.
>
> In any case, its moot. Andi indicated this patch is probably a no-op so
> I was considering dropping it on the v2 pass.

The might_sleep is annotation and well as a conditional preemption
point for the regular kernel. You might want to do a schedule check
there, but it's the wrong function if memory serves me correctly. It's
reserved for things that actually are design to sleep. The rt_spin*()
function are really a method of preserving BKL semantics across real
schedule() calls. You'd have to use something else instead for that
purpose like cond_reschedule() instead.

bill

2008-02-23 07:37:25

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism


On Fri, 2008-02-22 at 13:36 -0700, Peter W. Morreale wrote:
> On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
> >
> > In high-contention, short-hold time situations, it may even make sense
> > to have multiple CPUs with multiple waiters spinning, depending on
> > hold-time vs. time to put a waiter to sleep and wake them up.
> >
> > The wake-up side could also walk ahead on the queue, and bring up
> > spinners from sleeping, so that they are all ready to go when the lock
> > flips green for them.
> >
>
> I did try an attempt at this one time. The logic was merely if the
> pending owner was running, wakeup the next waiter. The results were
> terrible for the benchmarks used, as compared to the current
> implementation.

Yup, but you cut the CONTEXT where I said:

"for very large SMP systems"

Specifically, what I mean, is an SMP system, where I have enough CPUs to
do this:

let (t_Tcs) be the time to lock, transition and unlock an un-contended
critical section (i.e. the one that I am the pending waiter for).

let (t_W) be the time to wake up a sleeping task.

and let (t_W > t_Tcs)

Then, "for very large SMP systems"

if

S = (t_W / t_Tcs),

then S designates the number of tasks to transition a critical section
before the first sleeper would wake up.

and the number of CPUs > S.

The time for an arbitrary number of tasks N > S which are all competing
for lock L, to transition a critical section (T_N_cs), approaches:

T_N_cs = (N * t_W)

if you have only 1 task spinning.

but if you can have

N tasks spinning, (T_N_cs) approaches:

T_N_cs = (N * t_Tcs)

and with the premise, that t_W > t_Tcs, you should see a dramatic
throughput improvement when running PREEMPT_RT on VERY LARGE SMP
systems.

I want to disclaim, that the math above is very much simplified, but I
hope its sufficient to demonstrate the concept.

I have to acknowledge Ingo's comments, that this is all suspect until
proven to make a positive difference in "non-marketing" workloads.

I personally *think* we are past that already, and the adaptive concept
can and will be extended and scaled as M-socket and N-core based SMP
proliferates into to larger grid-based systems. But there is plenty more
to do to prove it.

(someone send me a 1024 CPU box and a wind-powered-generator)

Sven



>
> What this meant was that virtually every unlock performed a wakeup, if
> not for the new pending owner, than the next-in-line waiter.
>
> My impression at the time was that the contention for the rq lock is
> significant, regardless of even if the task being woken up was already
> running.
>
> I can generate numbers if that helps.
>
> -PWM
>
>
>

2008-02-23 08:08:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

On Thu, 21 Feb 2008 22:24:20 +0100 Ingo Molnar <[email protected]> wrote:

> regarding the concept: adaptive mutexes have been talked about in the
> past, but their advantage is not at all clear, that's why we havent done
> them. It's definitely not an unambigiously win-win concept.

When ext3 was converted from sleeping locks to spinlocks, dbench-on-numaq
throughput went up by a factor of ten. I'd expect that what RT has done
was a truly awful change for lots of workloads on lots of machines.

Yeah, there's the dont-enable-it-if-you're-doing-that option, but that adds
the we-just-doubled-the-number-of-kernels-distros-need-to-ship problem.

Does -rt also make bit_spin_lock preemptible? If not, I'd have thought the
change was of little benefit to ext3/jbd and it might as well go back to
spinning locks.

2008-02-23 12:31:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

> *) compute the context-switch pair time average for the system. This is
> your time threshold (CSt).

This is not a uniform time. Consider the difference between
context switch on the same hyperthread, context switch between cores
on a die, context switch between sockets, context switch between
distant numa nodes. You could have several orders of magnitude
between all those.

>
> *) For each lock, maintain an average hold-time (AHt) statistic (I am
> assuming this can be done cheaply...perhaps not).

That would assume that the hold times are very uniform. But what happens
when you e.g. have a workload where 50% of the lock aquisitions are short
and 30% are long?

I'm a little sceptical of such "too clever" algorithms.

-Andi

2008-02-23 16:32:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

On Sat, Feb 23, 2008 at 01:31:00PM +0100, Andi Kleen wrote:
> > *) compute the context-switch pair time average for the system. This is
> > your time threshold (CSt).
>
> This is not a uniform time. Consider the difference between
> context switch on the same hyperthread, context switch between cores
> on a die, context switch between sockets, context switch between
> distant numa nodes. You could have several orders of magnitude
> between all those.

Wouldn't the common case for blocking on a lock with a short hold
time be the minimal context-switch pair on the same CPU?

> > *) For each lock, maintain an average hold-time (AHt) statistic (I am
> > assuming this can be done cheaply...perhaps not).
>
> That would assume that the hold times are very uniform. But what happens
> when you e.g. have a workload where 50% of the lock aquisitions are short
> and 30% are long?
>
> I'm a little sceptical of such "too clever" algorithms.

Me too. That said, I cannot resist pointing out that the measurement
of interest would be the fraction of lock-hold times that are shorter
than the context-switch time. ;-)

Thanx, Paul

2008-02-25 05:21:01

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

Bill Huey (hui) wrote:
>
> The might_sleep is annotation and well as a conditional preemption
> point for the regular kernel. You might want to do a schedule check
> there, but it's the wrong function if memory serves me correctly. It's
> reserved for things that actually are design to sleep.

Note that might_sleep() already does a cond_resched() on the
configurations that need it, so I am not sure what you are getting at
here. Is that not enough?


> The rt_spin*()
> function are really a method of preserving BKL semantics across real
> schedule() calls. You'd have to use something else instead for that
> purpose like cond_reschedule() instead.

I dont quite understand this part either. From my perspective,
rt_spin*() functions are locking constructs that might sleep (or might
spin with the new patches), and they happen to be BKL and wakeup
transparent. To me, either the might_sleep() is correct for all paths
that don't fit the in_atomic-printk exception, or none of them are.

Are you saying that the modified logic that I introduced is broken? Or
that the original use of the might_sleep() annotation inside this
function is broken?

-Greg

2008-02-25 06:21:20

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

[repost with all folks CCed]

On Sun, Feb 24, 2008 at 9:20 PM, Gregory Haskins
<[email protected]> wrote:
> Are you saying that the modified logic that I introduced is broken? Or
> that the original use of the might_sleep() annotation inside this
> function is broken?

It's probably safe to use, but it's not what its original purpose was
and you should use another function/macro. This is an annotation issue
and your use of it is inconsistent with how it's used in voluntary
preempt. I mentioned it before in a previous post. Folks will correct
me if I'm wrong but you should use another macro or function.

bill

2008-02-25 09:02:40

by Bill Huey (hui)

[permalink] [raw]
Subject: Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

On Sun, Feb 24, 2008 at 10:21 PM, Bill Huey (hui) <[email protected]> wrote:
> It's probably safe to use, but it's not what its original purpose was
> and you should use another function/macro. This is an annotation issue
> and your use of it is inconsistent with how it's used in voluntary
> preempt. I mentioned it before in a previous post. Folks will correct
> me if I'm wrong but you should use another macro or function.

Actually, forget what I said, it's ok to use it.

bill

2008-02-25 23:58:15

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism


On Sat, 2008-02-23 at 13:31 +0100, Andi Kleen wrote:
> > *) compute the context-switch pair time average for the system. This is
> > your time threshold (CSt).
>

Hi Andi,

> This is not a uniform time. Consider the difference between
> context switch on the same hyperthread, context switch between cores
> on a die, context switch between sockets, context switch between
> distant numa nodes. You could have several orders of magnitude
> between all those.
>
> >
> > *) For each lock, maintain an average hold-time (AHt) statistic (I am
> > assuming this can be done cheaply...perhaps not).
>
> That would assume that the hold times are very uniform. But what happens
> when you e.g. have a workload where 50% of the lock aquisitions are short
> and 30% are long?
>
> I'm a little sceptical of such "too clever" algorithms.
>

I agree, this does not make any sense until you get to very large SMP
systems - and only iff it can be demonstrated, that the overhead
associated with lock statistics and including some notion of the context
switch overhead still comes out with a net gain.

There is some notion of task-routing in the RT scheduler already, and
this is quite a clever little algorithm.

I see no reason, why the scheduler should not eventually take directly
into account (when balancing), the quantifiable context-switch and CPU
overhead of moving a task to a distant processor.

In fact, for a deadline-based scheduler working on high-frequency tasks,
given that the times can switch so radically, this is would be
requiredOnve context-switch-overhead data is available to the scheduler,
there is no particular reason why adaptive locking could not also
utilize that data.

Sven

> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/