this is version -V7 of the generic mutex subsystem. It consists of the
following 11 patches:
add-atomic-xchg.patch
mutex-generic-asm-implementations.patch
mutex-asm-mutex.h-i386.patch
mutex-asm-mutex.h-x86_64.patch
mutex-asm-mutex.h-arm.patch
mutex-arch-mutex-h.patch
mutex-core.patch
mutex-docs.patch
mutex-debug.patch
mutex-debug-more.patch
xfs-mutex-namespace-collision-fix.patch
the patches are against Linus' latest GIT tree, and they should work
fine on every Linux architecture.
Changes since -V6:
33 files changed, 515 insertions(+), 360 deletions(-)
- added asm-arm/mutex.h ARM mutex fastpath implementation,
by Nicolas Pitre.
- as per Linus' suggestion, split up the mutex debugging code and
prototypes into 4 separate files: kernel/mutex.c, kernel/mutex.h,
kernel/mutex-debug.c and kernel/mutex-debug.h, and made the debugging
code build as kernel/mutex-debug.o. As a result mutex.c got smaller
and easier to read. This also eliminated an #ifdef.
- added a new "NULL mutex fastpath implementation" via
asm-generic/mutex-null.h, which is now used by the debugging code.
This got rid of two ugly #ifdef's in mutex.c, and removed code as
well.
- comment cleanups by Nicolas Pitre.
- more doc/comment additions and updates.
- lots of code and style cleanups in various mutex headers.
Ingo
On Fri, 23 Dec 2005, Ingo Molnar wrote:
> this is version -V7 of the generic mutex subsystem. It consists of the
> following 11 patches:
Here's another one, allowing architecture specific trylock
implementations. New generic xchg-based and ARMv6 implementations are
provided, while everything else remains the same.
Signed-off-by: Nicolas Pitre <[email protected]>
Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 0
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ int prev = atomic_xchg(count, 0);
+
+ if (unlikely(prev < 0)) {
+ /*
+ * The lock was marked contended so we must restore that
+ * state. If while doing so we get back a prev value of 1
+ * then we just own it.
+ *
+ * IN all cases this has the potential to trigger the
+ * slowpath for the owner's unlock path - but this is not
+ * a big problem in practice.
+ */
+ prev = atomic_xchg(count, -1);
+ if (prev < 0)
+ prev = 0;
+ }
+
+ return prev;
+}
+
#endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ /*
+ * We have two variants here. The cmpxchg based one is the best one
+ * because it never induce a false contention state.
+ * If not available we fall back to the atomic_dec_return variant.
+ *
+ * The atomic_dec_return variant might end up making the counter
+ * negative in the failure case, which may trigger the slowpath
+ * for the owner's unlock path - but this is not a big problem
+ * in practice.
+ */
+#ifdef __HAVE_ARCH_CMPXCHG
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+#else
+ if (likely(atomic_dec_return(count) == 0))
+ return 1;
+#endif
+ return 0;
+}
+
#endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+ else
+ return 0;
+}
+
#endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
return -EINTR;
}
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- * negative in the failure case, which may trigger the slowpath
- * for the owner's unlock path - but this is not a big problem
- * in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
/***
* mutex_trylock - try acquire the mutex, without waiting
* @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
*/
int fastcall mutex_trylock(struct mutex *lock)
{
-#ifdef __HAVE_ARCH_CMPXCHG
- if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
- return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+ return __mutex_trylock(&lock->count);
#else
- if (atomic_dec_return(&lock->count) == 0)
- return 1;
-#endif
- return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+ /*
+ * In the debug case we take the spinlock and check whether we can
+ * get the lock:
+ */
struct thread_info *ti = current_thread_info();
int ret = 0;
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex
spin_unlock_mutex(&lock->wait_lock);
return ret;
+#endif
}
-#endif /* CONFIG_DEBUG_MUTEXES */
-
/*
* Release the lock, slowpath:
*/
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do { \
*/
#define __mutex_slowpath_needs_to_unlock() 1
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+ int __ex_flag, __res;
+ __asm__ (
+ "ldrex %0, [%2]\n\t"
+ "subs %0, %0, #1\n\t"
+ "strexeq %1, %0, [%2]"
+ : "=&r" (__res), "=&r" (__ex_flag)
+ : "r" (&count->counter)
+ : "cc","memory" );
+ __res |= __ex_flag;
+ return __res == 0;
+}
+
#endif
#endif
On Sat, 24 Dec 2005, Nicolas Pitre wrote:
> Here's another one, allowing architecture specific trylock
> implementations. New generic xchg-based and ARMv6 implementations are
> provided, while everything else remains the same.
Sorry, the previous patch was missing one file (mutex-dec.h).
Signed-off-by: Nicolas Pitre <[email protected]>
Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 0
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ int prev = atomic_xchg(count, 0);
+
+ if (unlikely(prev < 0)) {
+ /*
+ * The lock was marked contended so we must restore that
+ * state. If while doing so we get back a prev value of 1
+ * then we just own it.
+ *
+ * IN all cases this has the potential to trigger the
+ * slowpath for the owner's unlock path - but this is not
+ * a big problem in practice.
+ */
+ prev = atomic_xchg(count, -1);
+ if (prev < 0)
+ prev = 0;
+ }
+
+ return prev;
+}
+
#endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ /*
+ * We have two variants here. The cmpxchg based one is the best one
+ * because it never induce a false contention state.
+ * If not available we fall back to the atomic_dec_return variant.
+ *
+ * The atomic_dec_return variant might end up making the counter
+ * negative in the failure case, which may trigger the slowpath
+ * for the owner's unlock path - but this is not a big problem
+ * in practice.
+ */
+#ifdef __HAVE_ARCH_CMPXCHG
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+#else
+ if (likely(atomic_dec_return(count) == 0))
+ return 1;
+#endif
+ return 0;
+}
+
#endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+ else
+ return 0;
+}
+
#endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
return -EINTR;
}
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- * negative in the failure case, which may trigger the slowpath
- * for the owner's unlock path - but this is not a big problem
- * in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
/***
* mutex_trylock - try acquire the mutex, without waiting
* @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
*/
int fastcall mutex_trylock(struct mutex *lock)
{
-#ifdef __HAVE_ARCH_CMPXCHG
- if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
- return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+ return __mutex_trylock(&lock->count);
#else
- if (atomic_dec_return(&lock->count) == 0)
- return 1;
-#endif
- return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+ /*
+ * In the debug case we take the spinlock and check whether we can
+ * get the lock:
+ */
struct thread_info *ti = current_thread_info();
int ret = 0;
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex
spin_unlock_mutex(&lock->wait_lock);
return ret;
+#endif
}
-#endif /* CONFIG_DEBUG_MUTEXES */
-
/*
* Release the lock, slowpath:
*/
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do { \
*/
#define __mutex_slowpath_needs_to_unlock() 1
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+ int __ex_flag, __res;
+ __asm__ (
+ "ldrex %0, [%2]\n\t"
+ "subs %0, %0, #1\n\t"
+ "strexeq %1, %0, [%2]"
+ : "=&r" (__res), "=&r" (__ex_flag)
+ : "r" (&count->counter)
+ : "cc","memory" );
+ __res |= __ex_flag;
+ return __res == 0;
+}
+
#endif
#endif
Index: linux-2.6/include/asm-generic/mutex-dec.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-dec.h
+++ linux-2.6/include/asm-generic/mutex-dec.h
@@ -63,4 +63,40 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ /*
+ * We have two variants here. The cmpxchg based one is the best one
+ * because it never induce a false contention state. It is included
+ * here because architectures using the inc/dec algorithms over the
+ * xchg ones are much more likely to support cmpxchg natively.
+ * If not we fall back to the atomic_dec_return variant.
+ *
+ * The atomic_dec_return variant might end up making the counter
+ * negative in the failure case, which may trigger the slowpath
+ * for the owner's unlock path - but this is not a big problem
+ * in practice.
+ */
+#ifdef __HAVE_ARCH_CMPXCHG
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+#else
+ if (likely(atomic_dec_return(count) == 0))
+ return 1;
+#endif
+ return 0;
+}
+
#endif
Ingo,
Coming next are 3 patches to your mutex code that I'd like you to
consider. They express my last requests about mutexes.
The first patch is a resent of my trylock rework to allow for a pure
xchg based implementation.
The second one gives architectures the ability to make lock/unlock
fastpaths inlined. As explained in another mail this is almost always
beneficial on ARM.
And the last patch makes mutex_is_locked() always inlined since I can't
find a reason why it wouldn't be beneficial to do so, even on x86.
I hope we can agree on them so that I could go back hiding behind
ARM-specific part of the kernel again. :-)
Nicolas
In the spirit of uniformity, this patch provides architecture specific
trylock implementations. This allows for a pure xchg-based implementation,
as well as an optimized ARMv6 implementation.
On i386 and x86_64 things remain the same.
Signed-off-by: Nicolas Pitre <[email protected]>
Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 0
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ int prev = atomic_xchg(count, 0);
+
+ if (unlikely(prev < 0)) {
+ /*
+ * The lock was marked contended so we must restore that
+ * state. If while doing so we get back a prev value of 1
+ * then we just own it.
+ *
+ * IN all cases this has the potential to trigger the
+ * slowpath for the owner's unlock path - but this is not
+ * a big problem in practice.
+ */
+ prev = atomic_xchg(count, -1);
+ if (prev < 0)
+ prev = 0;
+ }
+
+ return prev;
+}
+
#endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ /*
+ * We have two variants here. The cmpxchg based one is the best one
+ * because it never induce a false contention state.
+ * If not available we fall back to the atomic_dec_return variant.
+ *
+ * The atomic_dec_return variant might end up making the counter
+ * negative in the failure case, which may trigger the slowpath
+ * for the owner's unlock path - but this is not a big problem
+ * in practice.
+ */
+#ifdef __HAVE_ARCH_CMPXCHG
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+#else
+ if (likely(atomic_dec_return(count) == 0))
+ return 1;
+#endif
+ return 0;
+}
+
#endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+ else
+ return 0;
+}
+
#endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
return -EINTR;
}
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- * negative in the failure case, which may trigger the slowpath
- * for the owner's unlock path - but this is not a big problem
- * in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
/***
* mutex_trylock - try acquire the mutex, without waiting
* @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
*/
int fastcall mutex_trylock(struct mutex *lock)
{
-#ifdef __HAVE_ARCH_CMPXCHG
- if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
- return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+ return __mutex_trylock(&lock->count);
#else
- if (atomic_dec_return(&lock->count) == 0)
- return 1;
-#endif
- return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+ /*
+ * In the debug case we take the spinlock and check whether we can
+ * get the lock:
+ */
struct thread_info *ti = current_thread_info();
int ret = 0;
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex
spin_unlock_mutex(&lock->wait_lock);
return ret;
+#endif
}
-#endif /* CONFIG_DEBUG_MUTEXES */
-
/*
* Release the lock, slowpath:
*/
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do { \
*/
#define __mutex_slowpath_needs_to_unlock() 1
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+ int __ex_flag, __res;
+ __asm__ (
+ "ldrex %0, [%2]\n\t"
+ "subs %0, %0, #1\n\t"
+ "strexeq %1, %0, [%2]"
+ : "=&r" (__res), "=&r" (__ex_flag)
+ : "r" (&count->counter)
+ : "cc","memory" );
+ __res |= __ex_flag;
+ return __res == 0;
+}
+
#endif
#endif
Index: linux-2.6/include/asm-generic/mutex-dec.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-dec.h
+++ linux-2.6/include/asm-generic/mutex-dec.h
@@ -63,4 +63,40 @@ do { \
#define __mutex_slowpath_needs_to_unlock() 1
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ * @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+ /*
+ * We have two variants here. The cmpxchg based one is the best one
+ * because it never induce a false contention state. It is included
+ * here because architectures using the inc/dec algorithms over the
+ * xchg ones are much more likely to support cmpxchg natively.
+ * If not we fall back to the atomic_dec_return variant.
+ *
+ * The atomic_dec_return variant might end up making the counter
+ * negative in the failure case, which may trigger the slowpath
+ * for the owner's unlock path - but this is not a big problem
+ * in practice.
+ */
+#ifdef __HAVE_ARCH_CMPXCHG
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ return 1;
+#else
+ if (likely(atomic_dec_return(count) == 0))
+ return 1;
+#endif
+ return 0;
+}
+
#endif
Some architectures, notably ARM for instance, might benefit from inlining
the mutex fast paths. This patch allows for this when no debugging,
otherwise inlining is unconditionally disabled when debugging is enabled.
Signed-off-by: Nicolas Pitre <[email protected]>
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -8,6 +8,12 @@
#ifndef _ASM_MUTEX_H
#define _ASM_MUTEX_H
+/* On ARM it is best to inline the mutex fast paths, unless swp is broken. */
+#include <asm/system.h>
+#ifndef swp_is_buggy
+# define MUTEX_INLINE_FASTPATH
+#endif
+
#if __LINUX_ARM_ARCH__ < 6
/* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
# include <asm-generic/mutex-xchg.h>
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -16,16 +16,10 @@
#include <linux/spinlock.h>
#include <linux/interrupt.h>
-/*
- * In the DEBUG case we are using the "NULL fastpath" for mutexes,
- * which forces all calls into the slowpath:
- */
#ifdef CONFIG_DEBUG_MUTEXES
# include "mutex-debug.h"
-# include <asm-generic/mutex-null.h>
#else
# include "mutex.h"
-# include <asm/mutex.h>
#endif
/***
@@ -219,6 +213,7 @@ __mutex_lock_interruptible_nonatomic(str
return -EINTR;
}
+#ifndef MUTEX_INLINE_FASTPATH
/***
* mutex_trylock - try acquire the mutex, without waiting
* @lock: the mutex to be acquired
@@ -258,6 +253,7 @@ int fastcall mutex_trylock(struct mutex
return ret;
#endif
}
+#endif
/*
* Release the lock, slowpath:
@@ -293,8 +289,16 @@ static inline void __mutex_unlock_nonato
* We want the atomic ops to come first in the kernel image, to make
* sure the branch is predicted by the CPU as default-untaken:
*/
-static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count
- __IP_DECL__));
+
+#ifndef MUTEX_INLINE_FASTPATH
+/* if we don't inline fast paths, then make those static */
+static void fastcall __sched
+__mutex_lock_noinline(atomic_t *lock_count __IP_DECL__);
+static int fastcall __sched
+__mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__);
+static void fastcall __sched
+__mutex_unlock_noinline(atomic_t *lock_count __IP_DECL__);
+#endif
/*
* The locking fastpath is the 1->0 transition from 'unlocked' into
@@ -305,7 +309,7 @@ static inline void __mutex_lock_atomic(s
__mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
}
-static void fastcall __sched
+void fastcall __sched
__mutex_lock_noinline(atomic_t *lock_count __IP_DECL__)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);
@@ -318,16 +322,13 @@ static inline void __mutex_lock(struct m
__mutex_lock_atomic(lock __IP__);
}
-static int fastcall __sched
-__mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__);
-
static inline int __mutex_lock_interruptible(struct mutex *lock __IP_DECL__)
{
return __mutex_fastpath_lock_retval
(&lock->count, __mutex_lock_interruptible_noinline);
}
-static int fastcall __sched
+int fastcall __sched
__mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);
@@ -335,9 +336,6 @@ __mutex_lock_interruptible_noinline(atom
return __mutex_lock_interruptible_nonatomic(lock __IP__);
}
-static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count
- __IP_DECL__));
-
/*
* The unlocking fastpath is the 0->1 transition from 'locked' into
* 'unlocked' state:
@@ -347,8 +345,8 @@ static inline void __mutex_unlock_atomic
__mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
}
-static void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count
- __IP_DECL__)
+void fastcall __sched
+__mutex_unlock_noinline(atomic_t *lock_count __IP_DECL__)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);
@@ -360,6 +358,8 @@ static inline void __mutex_unlock(struct
__mutex_unlock_atomic(lock __IP__);
}
+#ifndef MUTEX_INLINE_FASTPATH
+
/***
* mutex_lock - acquire the mutex
* @lock: the mutex to be acquired
@@ -422,6 +422,8 @@ EXPORT_SYMBOL(mutex_lock);
EXPORT_SYMBOL(mutex_unlock);
EXPORT_SYMBOL(mutex_lock_interruptible);
+#endif
+
/***
* mutex_init - initialize the mutex
* @lock: the mutex to be initialized
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -92,10 +92,53 @@ struct mutex_waiter {
extern void FASTCALL(__mutex_init(struct mutex *lock, const char *name));
+/*
+ * In the DEBUG case we are using the "NULL fastpath" for mutexes,
+ * which forces all calls into the slowpath. We also unconditionally disable
+ * any fastpath inlining.
+ */
+#ifdef CONFIG_DEBUG_MUTEXES
+# include <asm-generic/mutex-null.h>
+# undef MUTEX_INLINE_FASTPATH
+#else
+# include <asm/mutex.h>
+#endif
+
+#ifdef MUTEX_INLINE_FASTPATH
+
+#include <linux/sched.h>
+extern void fastcall __sched __mutex_lock_noinline(atomic_t *lock_count);
+extern int fastcall __sched __mutex_lock_interruptible_noinline(atomic_t *lock_count);
+extern void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count);
+
+static inline int mutex_trylock(struct mutex *lock)
+{
+ return __mutex_trylock(&lock->count);
+}
+
+static inline void mutex_lock(struct mutex *lock)
+{
+ __mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
+}
+
+static inline int mutex_lock_interruptible(struct mutex *lock)
+{
+ return __mutex_fastpath_lock_retval
+ (&lock->count, __mutex_lock_interruptible_noinline);
+}
+
+static inline void mutex_unlock(struct mutex *lock)
+{
+ __mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
+}
+
+#else
extern void FASTCALL(mutex_lock(struct mutex *lock));
extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
extern int FASTCALL(mutex_trylock(struct mutex *lock));
extern void FASTCALL(mutex_unlock(struct mutex *lock));
+#endif
+
extern int FASTCALL(mutex_is_locked(struct mutex *lock));
#endif
There is currently no advantage to not always inlining mutex_is_locked,
even on x86.
Signed-off-by: Nicolas Pitre <[email protected]>
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -22,19 +22,6 @@
# include "mutex.h"
#endif
-/***
- * mutex_is_locked - is the mutex locked
- * @lock: the mutex to be queried
- *
- * Returns 1 if the mutex is locked, 0 if unlocked.
- */
-int fastcall mutex_is_locked(struct mutex *lock)
-{
- return atomic_read(&lock->count) != 1;
-}
-
-EXPORT_SYMBOL(mutex_is_locked);
-
/*
* Block on a lock - add ourselves to the list of waiters.
* Called with lock->wait_lock held.
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -139,6 +139,9 @@ extern int FASTCALL(mutex_trylock(struct
extern void FASTCALL(mutex_unlock(struct mutex *lock));
#endif
-extern int FASTCALL(mutex_is_locked(struct mutex *lock));
+static inline int fastcall mutex_is_locked(struct mutex *lock)
+{
+ return atomic_read(&lock->count) != 1;
+}
#endif
* Nicolas Pitre <[email protected]> wrote:
> There is currently no advantage to not always inlining
> mutex_is_locked, even on x86.
thanks, applied.
Ingo
* Nicolas Pitre <[email protected]> wrote:
> In the spirit of uniformity, this patch provides architecture specific
> trylock implementations. This allows for a pure xchg-based
> implementation, as well as an optimized ARMv6 implementation.
hm, i dont really like the xchg variant:
> +static inline int
> +__mutex_trylock(atomic_t *count)
> +{
> + int prev = atomic_xchg(count, 0);
> +
> + if (unlikely(prev < 0)) {
> + /*
> + * The lock was marked contended so we must restore that
> + * state. If while doing so we get back a prev value of 1
> + * then we just own it.
> + *
> + * IN all cases this has the potential to trigger the
> + * slowpath for the owner's unlock path - but this is not
> + * a big problem in practice.
> + */
> + prev = atomic_xchg(count, -1);
> + if (prev < 0)
> + prev = 0;
> + }
here we go to great trouble trying to avoid the 'slowpath', while we
unconditionally force the next unlock into the slowpath! So we have not
won anything. (on a cycle count basis it's probably even a net loss)
The same applies to atomic_dec_return() based trylock variant. Only the
cmpxchg based one, and the optimized ARM variant should be used as a
'fastpath'.
another thing is that this solution still leaves that ugly #ifdef in
kernel/mutex.c.
so i took a different solution that solves both problems. The API
towards architectures is now:
static inline int
__mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))
note the new 'fn' parameter: this enables mutex-null.h to do a simple:
#define __mutex_fastpath_trylock(count, fn_name) fn_name(count)
and the final ugly debugging related #ifdef is gone from kernel/mutex.c!
Furthermore, both the mutex-xchg.h and the mutex-dec.h non-cmpxchg
variant will fall back to the spinlock implementation.
Ingo
* Nicolas Pitre <[email protected]> wrote:
> Some architectures, notably ARM for instance, might benefit from
> inlining the mutex fast paths. [...]
what is the effect on text size? Could you post the before- and
after-patch vmlinux 'size kernel/test.o' output in the nondebug case,
with Arjan's latest 'convert a couple of semaphore users to mutexes'
patch applied? [make sure you've got enough of those users compiled in,
so that the inlining cost is truly measured. Perhaps also do
before/after 'size' output of a few affected .o files, without mixing
kernel/mutex.o into it, like vmlinux does.]
Ingo
On Mon, 2005-12-26 at 14:25 -0500, Nicolas Pitre wrote:
> Index: linux-2.6/include/asm-arm/mutex.h
> ===================================================================
> --- linux-2.6.orig/include/asm-arm/mutex.h
> +++ linux-2.6/include/asm-arm/mutex.h
> @@ -98,5 +98,31 @@ do { \
> */
> #define __mutex_slowpath_needs_to_unlock() 1
>
> +/*
> + * For __mutex_trylock we use another construct which could be described
> + * as an "incomplete atomic decrement" or a "single value cmpxchg" since
> + * it has two modes of failure:
> + *
> + * 1) if the exclusive store fails we fail, and
> + *
> + * 2) if the decremented value is not zero we don't even attempt the store.
btw I really think that 1) is wrong. trylock should do everything it can
to get the semaphore short of sleeping. Just because some cacheline got
written to (which might even be shared!) in the middle of the atomic op
is not a good enough reason to fail the trylock imho. Going into the
slowpath.. fine. But here it's a quality of implementation issue; you
COULD get the semaphore without sleeping (at least probably, you'd have
to retry to know for sure) but because something wrote to the same
cacheline as the lock... no. that's just not good enough.. sorry.
* Arjan van de Ven <[email protected]> wrote:
> > + * 1) if the exclusive store fails we fail, and
> > + *
> > + * 2) if the decremented value is not zero we don't even attempt the store.
>
>
> btw I really think that 1) is wrong. trylock should do everything it
> can to get the semaphore short of sleeping. Just because some
> cacheline got written to (which might even be shared!) in the middle
> of the atomic op is not a good enough reason to fail the trylock imho.
> Going into the slowpath.. fine. But here it's a quality of
> implementation issue; you COULD get the semaphore without sleeping (at
> least probably, you'd have to retry to know for sure) but because
> something wrote to the same cacheline as the lock... no. that's just
> not good enough.. sorry.
point. I solved this in my tree by calling the generic trylock <fn> if
there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus
the call is under unlikely()), and should thus still enable the fast
implementation.
Ingo
On Tue, 27 Dec 2005, Ingo Molnar wrote:
>
> * Nicolas Pitre <[email protected]> wrote:
>
> > In the spirit of uniformity, this patch provides architecture specific
> > trylock implementations. This allows for a pure xchg-based
> > implementation, as well as an optimized ARMv6 implementation.
>
> hm, i dont really like the xchg variant:
>
> > +static inline int
> > +__mutex_trylock(atomic_t *count)
> > +{
> > + int prev = atomic_xchg(count, 0);
> > +
> > + if (unlikely(prev < 0)) {
> > + /*
> > + * The lock was marked contended so we must restore that
> > + * state. If while doing so we get back a prev value of 1
> > + * then we just own it.
> > + *
> > + * IN all cases this has the potential to trigger the
> > + * slowpath for the owner's unlock path - but this is not
> > + * a big problem in practice.
> > + */
> > + prev = atomic_xchg(count, -1);
> > + if (prev < 0)
> > + prev = 0;
> > + }
>
> here we go to great trouble trying to avoid the 'slowpath', while we
> unconditionally force the next unlock into the slowpath! So we have not
> won anything. (on a cycle count basis it's probably even a net loss)
I disagree. There is no unconditional forcing into the slow path
at all. Please consider again.
1) We try to lock with 0.
If the previous value was 1 (likely) or 0 (less likely) we return
that value right away. We therefore cover 2 out of 3 cases already
with no more instructions or cycles than the successful fastpath
lock. Note the if (unlikely(prev < 0)) above that handles those two
cases with one test, and if it gets inlined then the outer code
testing for the return value would rely on the same test to decide
what to do since the condition flags are already set (gcc apparently
gets it right on ARM at least). This is the likely case solved.
2) If the previous value wasn't 1 nor 0 it is therefore contended
already. We agree that the "already locked with contention" is the
less likely of all 3 cases, right? So it is already unlikely
that this case will be considered. Still, since in this case we changed
the lock from contended to simply locked, we have to
mark it contended again otherwise we risk missing on waking up
waiters. But if by chance, and you'd have to be extremely lucky
since the window is really small, if by chance the value changed from
0 to 1 (remember we set it to 0 above) before we swap it back to -1
then we simply own the lock. If it was still 0 then we simply set it
back to contended as it was previously. And if it became -1 then we
didn't change anything by setting it to -1. But in all cases (whether
the owner unlocked it (from 0 to 1), or even someone else came along
to lock it (from 0 to 1 to 0), we know that there is/are sleeping
waiters that the previous owner failed to wake up because we made the
value 0 for a while. So whether we are the new owner, or someone else is,
there is still the need for waking up a waiter upon unlock
regardless.
So, as you can see, my xchg-based trylock doesn't force anything in
99.999% of the cases. The only possibility for a false contention state
would involve:
- process A locking the mutex (set to 0)
- process B locking the mutex and failing (now set to -1)
- process C attempting a trylock (new 0, prev = -1)
- process A unlocking (from 0 to 1 -> no waking process B)
- process D locking (from 1 to 0)
- process E locking and failing (from 0 to -1)
- process D unlocking (it is -1 so wake up process B, leave it to -1)
- process B unlocking (it is -1 so wake up process E, set to 0)
- process E unlock (from 0 to 1)
- process C, midway into trylock, restore contention flag (from 1 to -1)
- when process C unlocks, it branches to the slow path needlessly.
Do you really think we should care about such a twisted scenario?
> The same applies to atomic_dec_return() based trylock variant. Only the
> cmpxchg based one, and the optimized ARM variant should be used as a
> 'fastpath'.
On ARM prior version 6 a cmpxchg is what we need to avoid as much as
possible, for the same reason I've been advocating for xchg. On ARM
prior version 6 trying to perform a cmpxchg is as costly as an
atomic_dec, and they are both more costly and bigger than my xchg-based
trylock algorithm. And contrary to the atomic_dec-based trylock, the
xchg-based one does not unconditionally force any unlock into the slow
path.
avoidance with mutexes.
> Furthermore, both the mutex-xchg.h and the mutex-dec.h non-cmpxchg
> variant will fall back to the spinlock implementation.
Given the above demonstration I really think you should use my
xchg-based trylock in mutex-xchg.h. At least on ARM it is _still_
cheaper and smaller than a preemption disable implied by a spinlock.
Nicolas
On Tue, 27 Dec 2005, Ingo Molnar wrote:
>
> * Nicolas Pitre <[email protected]> wrote:
>
> > Some architectures, notably ARM for instance, might benefit from
> > inlining the mutex fast paths. [...]
>
> what is the effect on text size? Could you post the before- and
> after-patch vmlinux 'size kernel/test.o' output in the nondebug case,
> with Arjan's latest 'convert a couple of semaphore users to mutexes'
> patch applied? [make sure you've got enough of those users compiled in,
> so that the inlining cost is truly measured. Perhaps also do
> before/after 'size' output of a few affected .o files, without mixing
> kernel/mutex.o into it, like vmlinux does.]
Theory should be convincing enough. First of all, all the semaphore
fast paths are always inlined currently, on all architectures I've
looked at. A down() fast path is always looking like this:
mrs ip, cpsr
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
subs lr, lr, #1
str lr, [r0]
msr cpsr_c, ip
movmi ip, r0
blmi __down_failed
So our starting point for comparison is 9 instructions for every down()
occurence in the kernel. Same thing for up(). Every instruction is
invariably 4 bytes.
Now let's look at the typical mutex_lock():
mov r4, #0
swp r3, r4, [r0]
cmp r3, #1
blne __mutex_lock_noinline
This is 4 instructions. Further more, the first "mov r4, #0" can often
be eliminated when gcc can cse the constant 0 from another
register. We're talking about 3 instructions then, down from 9 !
We therefore saves between 20 and 24 bytes of kernel .text for every
down() and every up() simply going with mutexes.
Now if the mutex_lock and mutex_unlock were not inlined, the above 3 or
4 instructions would become one or two per call site, which is still a
gain in space, however not as important as the one provided by the move
from semaphores to mutexes. It however would be more costly in terms of
cycles since a function prologue and epilogue is somewhat costly on ARM,
especially with frame pointer enabled (I'll let RMK elaborate on his
reasons for not disabling them).
And for mutex_lock_interruptible(), the inlined fastpath is not bigger
than the non-inlined one, considering that the return value has to be
tested (the test is done twice in the non-inlined case: once inside the
function, and once outside of it) while the inlined version needs only
one test. They are therefore equivalent in terms of space.
Nicolas
* Nicolas Pitre <[email protected]> wrote:
> > * Nicolas Pitre <[email protected]> wrote:
> >
> > > Some architectures, notably ARM for instance, might benefit from
> > > inlining the mutex fast paths. [...]
> >
> > what is the effect on text size? Could you post the before- and
> > after-patch vmlinux 'size kernel/test.o' output in the nondebug case,
> > with Arjan's latest 'convert a couple of semaphore users to mutexes'
> > patch applied? [make sure you've got enough of those users compiled in,
> > so that the inlining cost is truly measured. Perhaps also do
> > before/after 'size' output of a few affected .o files, without mixing
> > kernel/mutex.o into it, like vmlinux does.]
>
> Theory should be convincing enough. [...]
please provide actual measurements (just a simple pre-patch and
post-patch 'size' output of vmlinux is enough), so that we can see the
inlining cost. Note that x86 went to a non-inlined fastpath _despite_
having a compact CISC semaphore fastpath.
Ingo
* Nicolas Pitre <[email protected]> wrote:
> > here we go to great trouble trying to avoid the 'slowpath', while we
> > unconditionally force the next unlock into the slowpath! So we have
> > not won anything. (on a cycle count basis it's probably even a net
> > loss)
>
> I disagree. [...elaborate analysis of the code ...]
you are right, it should work fine, and should be optimal. I'll add your
xchg variant to mutex-xchg.h.
Ingo
* Ingo Molnar <[email protected]> wrote:
> * Nicolas Pitre <[email protected]> wrote:
>
> > > here we go to great trouble trying to avoid the 'slowpath', while we
> > > unconditionally force the next unlock into the slowpath! So we have
> > > not won anything. (on a cycle count basis it's probably even a net
> > > loss)
> >
> > I disagree. [...elaborate analysis of the code ...]
>
> you are right, it should work fine, and should be optimal. I'll add
> your xchg variant to mutex-xchg.h.
the patch below adds it, and it boots fine on x86 with mutex.c hacked to
include asm-generic/mutex-xchg.h.
Ingo
Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -82,7 +82,25 @@ do { \
static inline int
__mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))
{
- return fn(count);
+ int prev = atomic_xchg(count, 0);
+
+ if (unlikely(prev < 0)) {
+ /*
+ * The lock was marked contended so we must restore that
+ * state. If while doing so we get back a prev value of 1
+ * then we just own it.
+ *
+ * [ In the rare case of the mutex going to 1 and then to 0
+ * in this few-instructions window, this has the potential
+ * to trigger the slowpath for the owner's unlock path, but
+ * that's not a problem in practice. ]
+ */
+ prev = atomic_xchg(count, -1);
+ if (prev < 0)
+ prev = 0;
+ }
+
+ return prev;
}
#endif
On Wed, 28 Dec 2005, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Nicolas Pitre <[email protected]> wrote:
> >
> > > > here we go to great trouble trying to avoid the 'slowpath', while we
> > > > unconditionally force the next unlock into the slowpath! So we have
> > > > not won anything. (on a cycle count basis it's probably even a net
> > > > loss)
> > >
> > > I disagree. [...elaborate analysis of the code ...]
> >
> > you are right, it should work fine, and should be optimal. I'll add
> > your xchg variant to mutex-xchg.h.
>
> the patch below adds it, and it boots fine on x86 with mutex.c hacked to
> include asm-generic/mutex-xchg.h.
Here's an additional patch to fix some comments, and to add a small
optimization.
Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -75,9 +75,14 @@ do { \
* @count: pointer of type atomic_t
* @fn: spinlock based trylock implementation
*
- * We call the spinlock based generic variant, because atomic_xchg() is
- * too destructive to provide a simpler trylock implementation than the
- * spinlock based one.
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ *
+ * If the architecture has no effective trylock variant, it should call the
+ * <fn> spinlock-based trylock variant unconditionally.
*/
static inline int
__mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))
@@ -90,12 +95,13 @@ __mutex_fastpath_trylock(atomic_t *count
* state. If while doing so we get back a prev value of 1
* then we just own it.
*
- * [ In the rare case of the mutex going to 1 and then to 0
- * in this few-instructions window, this has the potential
- * to trigger the slowpath for the owner's unlock path, but
- * that's not a problem in practice. ]
+ * [ In the rare case of the mutex going to 1, to 0, to -1
+ * and then back to 0 in this few-instructions window,
+ * this has the potential to trigger the slowpath for the
+ * owner's unlock path needlessly, but that's not a problem
+ * in practice. ]
*/
- prev = atomic_xchg(count, -1);
+ prev = atomic_xchg(count, prev);
if (prev < 0)
prev = 0;
}
* Nicolas Pitre <[email protected]> wrote:
> > the patch below adds it, and it boots fine on x86 with mutex.c hacked to
> > include asm-generic/mutex-xchg.h.
>
> Here's an additional patch to fix some comments, and to add a small
> optimization.
thanks, applied.
Ingo
On Wed, 28 Dec 2005, Ingo Molnar wrote:
> * Nicolas Pitre <[email protected]> wrote:
>
> > > * Nicolas Pitre <[email protected]> wrote:
> > >
> > > > Some architectures, notably ARM for instance, might benefit from
> > > > inlining the mutex fast paths. [...]
> > >
> > > what is the effect on text size? Could you post the before- and
> > > after-patch vmlinux 'size kernel/test.o' output in the nondebug case,
> > > with Arjan's latest 'convert a couple of semaphore users to mutexes'
> > > patch applied? [make sure you've got enough of those users compiled in,
> > > so that the inlining cost is truly measured. Perhaps also do
> > > before/after 'size' output of a few affected .o files, without mixing
> > > kernel/mutex.o into it, like vmlinux does.]
> >
> > Theory should be convincing enough. [...]
>
> please provide actual measurements (just a simple pre-patch and
> post-patch 'size' output of vmlinux is enough), so that we can see the
> inlining cost.
This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n,
therefore using the current semaphore code:
text data bss dec hex filename
1821108 287792 88264 2197164 2186ac vmlinux
Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with
mutexes:
text data bss dec hex filename
1797108 287568 88172 2172848 2127b0 vmlinux
Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
text data bss dec hex filename
1807824 287136 88172 2183132 214fdc vmlinux
This last case is not the smallest, but it is the fastest.
> Note that x86 went to a non-inlined fastpath _despite_
> having a compact CISC semaphore fastpath.
The function call overhead on x86 is less significant than the ARM one,
so always calling out of line code might be sensible in that case.
Nicolas
On Tue, 27 Dec 2005, Arjan van de Ven wrote:
> btw I really think that 1) is wrong. trylock should do everything it can
> to get the semaphore short of sleeping. Just because some cacheline got
> written to (which might even be shared!) in the middle of the atomic op
> is not a good enough reason to fail the trylock imho. Going into the
> slowpath.. fine. But here it's a quality of implementation issue; you
> COULD get the semaphore without sleeping (at least probably, you'd have
> to retry to know for sure) but because something wrote to the same
> cacheline as the lock... no. that's just not good enough.. sorry.
Well... actually it is not clear from the documentation how the
exclusive monitor is tagging accesses (lots of implementation specific
latitude). So you are right that it could cause too many false negative
results.
Nicolas
On Tue, 27 Dec 2005, Ingo Molnar wrote:
>
> * Arjan van de Ven <[email protected]> wrote:
>
> > > + * 1) if the exclusive store fails we fail, and
> > > + *
> > > + * 2) if the decremented value is not zero we don't even attempt the store.
> >
> >
> > btw I really think that 1) is wrong. trylock should do everything it
> > can to get the semaphore short of sleeping. Just because some
> > cacheline got written to (which might even be shared!) in the middle
> > of the atomic op is not a good enough reason to fail the trylock imho.
> > Going into the slowpath.. fine. But here it's a quality of
> > implementation issue; you COULD get the semaphore without sleeping (at
> > least probably, you'd have to retry to know for sure) but because
> > something wrote to the same cacheline as the lock... no. that's just
> > not good enough.. sorry.
>
> point. I solved this in my tree by calling the generic trylock <fn> if
> there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus
> the call is under unlikely()), and should thus still enable the fast
> implementation.
I'd solve it like this instead (on top of your latest patches):
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -110,12 +110,7 @@ do { \
/*
* For __mutex_fastpath_trylock we use another construct which could be
- * described as an "incomplete atomic decrement" or a "single value cmpxchg"
- * since it has two modes of failure:
- *
- * 1) if the exclusive store fails we fail, and
- *
- * 2) if the decremented value is not zero we don't even attempt the store.
+ * described as a "single value cmpxchg".
*
* This provides the needed trylock semantics like cmpxchg would, but it is
* lighter and less generic than a true cmpxchg implementation.
@@ -123,27 +118,22 @@ do { \
static inline int
__mutex_fastpath_trylock(atomic_t *count, int (*fn_name)(atomic_t *))
{
- int __ex_flag, __res;
+ int __ex_flag, __res, __orig;
__asm__ (
- "ldrex %0, [%2] \n"
- "subs %0, %0, #1 \n"
- "strexeq %1, %0, [%2] \n"
+ "1: ldrex %0, [%3] \n"
+ "subs %1, %0, #1 \n"
+ "strexeq %2, %1, [%3] \n"
+ "movlt %0, #0 \n"
+ "cmpeq %2, #0 \n"
+ "bgt 1b \n"
- : "=&r" (__res), "=&r" (__ex_flag)
+ : "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
: "r" (&count->counter)
: "cc", "memory" );
- /*
- * We must not return a synthetic 'failure' if the conditional
- * did not succeed - drop back into the generic slowpath if
- * this happens (should be rare):
- */
- if (unlikely(__ex_flag))
- return fn_name(count);
-
- return __res == 0;
+ return __orig;
}
#endif
Nicolas
* Nicolas Pitre <[email protected]> wrote:
> > > > + * 1) if the exclusive store fails we fail, and
> > > > + *
> > > > + * 2) if the decremented value is not zero we don't even attempt the store.
> > >
> > >
> > > btw I really think that 1) is wrong. trylock should do everything it
> > > can to get the semaphore short of sleeping. Just because some
> > > cacheline got written to (which might even be shared!) in the middle
> > > of the atomic op is not a good enough reason to fail the trylock imho.
> > > Going into the slowpath.. fine. But here it's a quality of
> > > implementation issue; you COULD get the semaphore without sleeping (at
> > > least probably, you'd have to retry to know for sure) but because
> > > something wrote to the same cacheline as the lock... no. that's just
> > > not good enough.. sorry.
> >
> > point. I solved this in my tree by calling the generic trylock <fn> if
> > there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus
> > the call is under unlikely()), and should thus still enable the fast
> > implementation.
>
> I'd solve it like this instead (on top of your latest patches):
thanks, applied.
> + "1: ldrex %0, [%3] \n"
> + "subs %1, %0, #1 \n"
> + "strexeq %2, %1, [%3] \n"
> + "movlt %0, #0 \n"
> + "cmpeq %2, #0 \n"
> + "bgt 1b \n"
so we are back to what is in essence a cmpxchg implementation?
Ingo
* Nicolas Pitre <[email protected]> wrote:
> This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n,
> therefore using the current semaphore code:
>
> text data bss dec hex filename
> 1821108 287792 88264 2197164 2186ac vmlinux
>
> Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with
> mutexes:
>
> text data bss dec hex filename
> 1797108 287568 88172 2172848 2127b0 vmlinux
>
> Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
>
> text data bss dec hex filename
> 1807824 287136 88172 2183132 214fdc vmlinux
>
> This last case is not the smallest, but it is the fastest.
i.e. 1.3% text savings from going to mutexes, and inlining them again
gives up 0.5% of that. We've uninlined stuff for a smaller gain in the
past ...
> > Note that x86 went to a non-inlined fastpath _despite_
> > having a compact CISC semaphore fastpath.
>
> The function call overhead on x86 is less significant than the ARM
> one, so always calling out of line code might be sensible in that
> case.
i'm highly doubtful we should do that. The spinlock APIs are 4 times
more frequent than mutexes are ever going to be, still they too are
mostly out of line. (and we only inline the unlock portions that are a
space win!) Can you measure any significant difference in performance?
(e.g. lat_pipe triggers the mutex fastpath, in DEBUG_MUTEX_FULL=y mode)
the performance won by inlining is often offset by the performance cost
of the higher icache footprint. (and ARM CPUs dont have that large
caches to begin with)
Ingo
Ingo Molnar wrote:
> * Nicolas Pitre <[email protected]> wrote:
>>+ "1: ldrex %0, [%3] \n"
>>+ "subs %1, %0, #1 \n"
>>+ "strexeq %2, %1, [%3] \n"
>>+ "movlt %0, #0 \n"
>>+ "cmpeq %2, #0 \n"
>>+ "bgt 1b \n"
>
>
> so we are back to what is in essence a cmpxchg implementation?
>
FWIW, I still think we should go for an open-coded "cmpxchg" variant
for UP that disables preempt, and an atomic_cmpxchg variant for SMP.
- good generic implementations
- the UP version is faster than atomic_xchg for non preempt on ARM
- if you really are counting cycles, you'd probably have preempt off
- if you've got preempt on then the preempt_ operations in semaphores
would be the least of your worries (how about spinlocks?)
Rather than straight out introducing lots of ugliness and complexity
for something that actually slows down the speed critical !preempt
case (but is unlikely to be measurable either way).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Thu, 29 Dec 2005, Ingo Molnar wrote:
> > I'd solve it like this instead (on top of your latest patches):
>
> thanks, applied.
>
> > + "1: ldrex %0, [%3] \n"
> > + "subs %1, %0, #1 \n"
> > + "strexeq %2, %1, [%3] \n"
> > + "movlt %0, #0 \n"
> > + "cmpeq %2, #0 \n"
> > + "bgt 1b \n"
>
> so we are back to what is in essence a cmpxchg implementation?
A limited cmpxchg. Using the generic cmpxchg for this would need either
9 or 10 instructions instead of 6.
Nicolas
On Thu, 29 Dec 2005, Nick Piggin wrote:
> FWIW, I still think we should go for an open-coded "cmpxchg" variant
> for UP that disables preempt, and an atomic_cmpxchg variant for SMP.
>
> - good generic implementations
> - the UP version is faster than atomic_xchg for non preempt on ARM
> - if you really are counting cycles, you'd probably have preempt off
> - if you've got preempt on then the preempt_ operations in semaphores
> would be the least of your worries (how about spinlocks?)
>
> Rather than straight out introducing lots of ugliness and complexity
> for something that actually slows down the speed critical !preempt
> case (but is unlikely to be measurable either way).
I provided you with the demonstration last week:
- the non SMP (ARM version < 6) is using xchg.
- xchg on ARM version < 6 is always faster and smaller than any
preemption disable.
- xchg on ARM is the same size as the smallest solution you may think of
when preemption is disabled and never slower (prove me otherwise? if
you wish).
- all xchg based primitives are "generic" code already.
And I think you didn't look at the overall patch set before talking
about "lots of ugliness", did you? The fact is that the code,
especially the core code, is much cleaner now than it was when
everything was seemingly "generic" since the current work on
architecture abstractions still allows optimizations in a way that is so
much cleaner and controlled than what happened with the semaphore code,
and the debugging code can even take advantage of it without polluting
the core code.
It happens that i386, x86_64 and ARM (if v6 or above) currently have
their own tweaks to save space and/or cycles in a pretty strictly
defined way. The effort is currently spent on making sure if other
architectures want to use one of their own tricks for those they can do
it without affecting the core code which remains 95% of the whole thing
(which is not the case for semaphores), and the currently provided
architecture specific versions are _never_ bigger nor slower than any
generic version would be. Otherwise what would be the point?
Nicolas
Nicolas Pitre wrote:
> I provided you with the demonstration last week:
>
I didn't really find it convincing.
> - the non SMP (ARM version < 6) is using xchg.
>
> - xchg on ARM version < 6 is always faster and smaller than any
> preemption disable.
>
Maybe true, but as I said, if you have preemption enabled, then there
are far more preempt_xxx operations in other places than semaphores,
which impact both size and speed.
> - xchg on ARM is the same size as the smallest solution you may think of
> when preemption is disabled and never slower (prove me otherwise? if
> you wish).
>
I was going from your numbers where you said it was IIRC a cycle faster.
> - all xchg based primitives are "generic" code already.
>
And yet there is a need for architecture specific code. Also having
xchg and a cmpxchg variants mean that you have two different sets of
possible interleavings of instructions, and xchg has much more subtle
cases as you demonstrated.
> And I think you didn't look at the overall patch set before talking
> about "lots of ugliness", did you? The fact is that the code,
Yes I did and I think it is more ugly than my proposal would be.
> especially the core code, is much cleaner now than it was when
> everything was seemingly "generic" since the current work on
> architecture abstractions still allows optimizations in a way that is so
> much cleaner and controlled than what happened with the semaphore code,
> and the debugging code can even take advantage of it without polluting
> the core code.
>
> It happens that i386, x86_64 and ARM (if v6 or above) currently have
> their own tweaks to save space and/or cycles in a pretty strictly
> defined way. The effort is currently spent on making sure if other
> architectures want to use one of their own tricks for those they can do
> it without affecting the core code which remains 95% of the whole thing
> (which is not the case for semaphores), and the currently provided
> architecture specific versions are _never_ bigger nor slower than any
> generic version would be. Otherwise what would be the point?
>
I don't think size is a great argument, because the operations should
be out of line anyway to save icache (your numbers showed a pretty
large increase when they're inlined)
And as for speed, I'm not arguing that you can't save a couple of
cycles, but I'm weighing that against the maintainability of a generic
implementation which has definite advantages. Wheras I don't think you
could demonstrate any real speed improvement for saving a few cycles
from slightly faster semaphore operations on CONFIG_PREEMPT kernels?
If someone ever did find the need for an arch specific variant that
really offers advantages, then there is nothing to stop tha being
added.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Sorry for the delay...
On Thu, 29 Dec 2005, Ingo Molnar wrote:
>
> * Nicolas Pitre <[email protected]> wrote:
>
> > This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n,
> > therefore using the current semaphore code:
> >
> > text data bss dec hex filename
> > 1821108 287792 88264 2197164 2186ac vmlinux
> >
> > Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with
> > mutexes:
> >
> > text data bss dec hex filename
> > 1797108 287568 88172 2172848 2127b0 vmlinux
> >
> > Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
> >
> > text data bss dec hex filename
> > 1807824 287136 88172 2183132 214fdc vmlinux
> >
> > This last case is not the smallest, but it is the fastest.
>
> i.e. 1.3% text savings from going to mutexes, and inlining them again
> gives up 0.5% of that. We've uninlined stuff for a smaller gain in the
> past ...
>
> > > Note that x86 went to a non-inlined fastpath _despite_
> > > having a compact CISC semaphore fastpath.
> >
> > The function call overhead on x86 is less significant than the ARM
> > one, so always calling out of line code might be sensible in that
> > case.
>
> i'm highly doubtful we should do that. The spinlock APIs are 4 times
> more frequent than mutexes are ever going to be, still they too are
> mostly out of line. (and we only inline the unlock portions that are a
> space win!) Can you measure any significant difference in performance?
> (e.g. lat_pipe triggers the mutex fastpath, in DEBUG_MUTEX_FULL=y mode)
Here it is.
With the default non inlined fast path:
Pipe latency: 14.2669 microseconds
Pipe latency: 14.2625 microseconds
Pipe latency: 14.2655 microseconds
Pipe latency: 14.2670 microseconds
Then, with fast paths inlined:
Pipe latency: 13.9483 microseconds
Pipe latency: 13.9409 microseconds
Pipe latency: 13.9468 microseconds
Pipe latency: 13.9529 microseconds
So inlining the mutex fast path is more than 2% faster for the whole
test.
Is that worth the 0.5% increase in kernel size in your opinion?
Note: I modified lat_pipe to use two threads instead of two processes
since on ARM switching mm require a full cache flush due to the VIVT
cache, and doing so skyrockets the pipe latency up to 480 microsecs with
the mutex difference lost in the noise.
Nicolas