This patchkit implements TSX lock elision for the kernel locks.
Lock elision uses hardware transactional memory to execute
locks in parallel.
This is just a RFC at this point, so that people can comment
on the code. Please send your feedback.
Code is against v3.9-rc3
Also available from:
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git
Branch: hle39/spinlock
The branch names may change as the tree is rebased.
For more details on the general elision concept please see:
http://halobates.de/adding-lock-elision-to-linux.pdf
http://lwn.net/Articles/533894/
Full TSX specification:
http://software.intel.com/file/41417 (chapter 8)
The patches provides the elision infrastructure and the changes
to the standard locks (rwsems, mutexes, spinlocks, rwspinlocks,
bit spinlocks) to elide.
The general strategy is to elide as many locks as possible,
and use a combination of manual disabling and automatic
adaptation to handle lock regions that do not elide well.
Some additional kernel changes are also useful to fix common
transaction aborts. I have not included those in this patchkit,
but they will be submitted separately. Many of these changes
improve general scalability, but improving cache line sharing
overhead.
Especially the adaptation algorithms have a lot of tunables.
The tuning is currently preliminary and will be revised later.
Some questions and answers:
- How much does it improve performance?
I cannot share any performance numbers at this point unfortunately.
Also please keep in mind that the tuning is very preliminary and
will be revised.
- How to test it:
You either need a system with Intel TSX. A qemu version with
TSX support is available from https://github.com/crjohns/qemu-tsx
and may also support the kernel (untested)
- The CONFIG_RTM_LOCKS option does not appear
Make sure CONFIG_PARAVIRT_GUEST and CONFIG_PARAVIRT_SPINLOCKS
is enabled. The spinlock code uses the paravirt locking infrastructure
to add elision.
- How does it interact with virtualization?
It cannot interoperate with Xen paravirtualized locks, but without
them lock elision should work in virtualization. If the Xen
pvlocks are active spinlock elision will be disabled.
This may be fixed at some point.
There are some limitations in perf TSX PMU profiling with virtualization.
- How to tune it:
Use perf with the TSX extensions and the statistics exposed in
/sys/module/rtm_locks
You may need the latest hsw/pmu* branch from
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git
- Why does this use RTM and not HLE
RTM is more flexible and we don't need HLE in this code.
Andi Kleen
[email protected]
Speaking for myself only
From: Andi Kleen <[email protected]>
Add generic macros to disable transactions per process.
Without TSX (or other HTM) support this is a noop.
An RTM enabled x86 kernel uses its own version.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/thread_info.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index e7e0473..5a4bec0 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -148,6 +148,12 @@ static inline bool test_and_clear_restore_sigmask(void)
#error "no set_restore_sigmask() provided and default one won't work"
#endif
+#ifndef CONFIG_RTM_LOCKS
+#define disable_txn() do {} while (0)
+#define reenable_txn() do {} while (0)
+#define txn_disabled() 0
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_THREAD_INFO_H */
--
1.7.7.6
From: Andi Kleen <[email protected]>
The TSX RTM lock elision code needs to distinguish spin unlocks that reenable
interrupts from others. Currently this is hidden in the higher level spinlock
code. Add support for calling an architecture specific arch_spin_unlock_flags/irq
if available.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/spinlock.h | 41 ++++++++++++++++++++++++++++++++++++++
include/linux/spinlock_api_smp.h | 6 +---
2 files changed, 43 insertions(+), 4 deletions(-)
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..c6c531a 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -128,11 +128,38 @@ static inline void smp_mb__after_lock(void) { smp_mb(); }
*/
#define raw_spin_unlock_wait(lock) arch_spin_unlock_wait(&(lock)->raw_lock)
+#ifndef ARCH_HAS_SPIN_UNLOCK_IRQ
+static inline void arch_spin_unlock_flags(arch_spinlock_t *lock, unsigned long flags)
+{
+ arch_spin_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void arch_spin_unlock_irq(arch_spinlock_t *lock)
+{
+ arch_spin_unlock(lock);
+ local_irq_enable();
+}
+#endif
+
#ifdef CONFIG_DEBUG_SPINLOCK
extern void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
#define do_raw_spin_lock_flags(lock, flags) do_raw_spin_lock(lock)
extern int do_raw_spin_trylock(raw_spinlock_t *lock);
extern void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock);
+static inline void do_raw_spin_unlock_flags(raw_spinlock_t *lock, unsigned long flags)
+ __releases(lock)
+{
+ do_raw_spin_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void do_raw_spin_unlock_irq(raw_spinlock_t *lock)
+ __releases(lock)
+{
+ do_raw_spin_unlock(lock);
+ local_irq_enable();
+}
#else
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
@@ -157,6 +184,20 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
arch_spin_unlock(&lock->raw_lock);
__release(lock);
}
+
+static inline void do_raw_spin_unlock_flags(raw_spinlock_t *lock, unsigned long flags)
+ __releases(lock)
+{
+ arch_spin_unlock_flags(&lock->raw_lock, flags);
+ __release(lock);
+}
+
+static inline void do_raw_spin_unlock_irq(raw_spinlock_t *lock)
+ __releases(lock)
+{
+ arch_spin_unlock_irq(&lock->raw_lock);
+ __release(lock);
+}
#endif
/*
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 51df117..cf9bf3b 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -157,16 +157,14 @@ static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
unsigned long flags)
{
spin_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_spin_unlock(lock);
- local_irq_restore(flags);
+ do_raw_spin_unlock_flags(lock, flags);
preempt_enable();
}
static inline void __raw_spin_unlock_irq(raw_spinlock_t *lock)
{
spin_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_spin_unlock(lock);
- local_irq_enable();
+ do_raw_spin_unlock_irq(lock);
preempt_enable();
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add adaptation support for ticket spinlocks. Each spinlock keeps a skip count
on how often to skip elision. This is controlled by the abort rate.
The actual adaptation algorithm is generic and shared with other lock types.
This avoids us having to tune each spinlock individually for elision.
This adds 2 bytes to a 32bit spinlock for systems configured for more than 256
CPUs, so it's now 6 bytes instead of 4. For systems with less than 256 CPUs it expands
spinlock_t from 2 to 4 bytes.
I played around with various schemes to merge the elision count with the main
tickets, but they all reduced the maximum CPU count support too much and caused
other problems. So in the end I just added the additional 2 bytes.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/spinlock_types.h | 3 +++
arch/x86/kernel/rtm-locks.c | 10 ++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..614159c 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -24,6 +24,9 @@ typedef struct arch_spinlock {
__ticket_t head, tail;
} tickets;
};
+#ifdef CONFIG_RTM_LOCKS
+ short elision_adapt;
+#endif
} arch_spinlock_t;
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index a313a81..1651049 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -76,16 +76,22 @@ static DEFINE_PER_CPU(bool, cli_elided);
static struct static_key spinlock_elision = STATIC_KEY_INIT_TRUE;
module_param(spinlock_elision, static_key, 0644);
+static __read_mostly struct elision_config spinlock_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(spinlock, spinlock_elision_config);
+
static int rtm_spin_trylock(struct arch_spinlock *lock)
{
- if (elide_lock(spinlock_elision, !__ticket_spin_is_locked(lock)))
+ if (elide_lock_adapt(spinlock_elision, !__ticket_spin_is_locked(lock),
+ &lock->elision_adapt, &spinlock_elision_config))
return 1;
return __ticket_spin_trylock(lock);
}
static inline void rtm_spin_lock(struct arch_spinlock *lock)
{
- if (!elide_lock(spinlock_elision, !__ticket_spin_is_locked(lock)))
+ if (!elide_lock_adapt(spinlock_elision, !__ticket_spin_is_locked(lock),
+ &lock->elision_adapt, &spinlock_elision_config))
__ticket_spin_lock(lock);
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
Convert the rwsem elision to be adaptive. This requires adding
4/8 bytes to the rwsem for the adaptation state. The algorithm
is the same as for other adaptive lock types. The elision
configuration is split for readers and writers.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/rwsem.h | 22 +++++++++++++++++-----
arch/x86/kernel/rtm-locks.c | 10 ++++++++++
include/linux/rwsem.h | 3 +++
3 files changed, 30 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index f9297af..da6437c 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -39,6 +39,7 @@
#ifdef __KERNEL__
#include <asm/asm.h>
#include <linux/elide.h>
+#include <linux/jump_label.h>
/*
* The bias values and the counter type limits the number of
@@ -58,14 +59,18 @@
#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-extern bool rwsem_elision;
+extern struct static_key rwsem_elision;
+extern struct elision_config readsem_elision_config;
+extern struct elision_config writesem_elision_config;
/*
* lock for reading
*/
static inline void __down_read(struct rw_semaphore *sem)
{
- if (elide_lock(rwsem_elision, sem->count == 0))
+ if (elide_lock_adapt(rwsem_elision, sem->count == 0,
+ &sem->elision_adapt,
+ &readsem_elision_config))
return;
asm volatile("# beginning down_read\n\t"
LOCK_PREFIX _ASM_INC "(%1)\n\t"
@@ -86,7 +91,9 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long result, tmp;
- if (elide_lock(rwsem_elision, sem->count == 0))
+ if (elide_lock_adapt(rwsem_elision, sem->count == 0,
+ &sem->elision_adapt,
+ &readsem_elision_config))
return 1;
asm volatile("# beginning __down_read_trylock\n\t"
" mov %0,%1\n\t"
@@ -111,7 +118,9 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
{
long tmp;
- if (elide_lock(rwsem_elision, sem->count == 0))
+ if (elide_lock_adapt(rwsem_elision, sem->count == 0,
+ &sem->elision_adapt,
+ &readsem_elision_config))
return;
asm volatile("# beginning down_write\n\t"
LOCK_PREFIX " xadd %1,(%2)\n\t"
@@ -139,7 +148,9 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
{
long ret;
- if (elide_lock(rwsem_elision, sem->count == 0))
+ if (elide_lock_adapt(rwsem_elision, sem->count == 0,
+ &sem->elision_adapt,
+ &writesem_elision_config))
return 1;
ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
RWSEM_ACTIVE_WRITE_BIAS);
@@ -195,6 +206,7 @@ static inline void __up_write(struct rw_semaphore *sem)
*/
static inline void __downgrade_write(struct rw_semaphore *sem)
{
+ /* if (sem->count == 0) return; */
asm volatile("# beginning __downgrade_write\n\t"
LOCK_PREFIX _ASM_ADD "%2,(%1)\n\t"
/*
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index 40a0e7d..5a33a12 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -436,6 +436,15 @@ static unsigned rtm_patch(u8 type, u16 clobbers, void *ibuf,
struct static_key mutex_elision = STATIC_KEY_INIT_FALSE;
module_param(mutex_elision, static_key, 0644);
+struct static_key rwsem_elision = STATIC_KEY_INIT_FALSE;
+module_param(rwsem_elision, static_key, 0644);
+__read_mostly struct elision_config readsem_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(readsem, readsem_elision_config);
+__read_mostly struct elision_config writesem_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(writesem, writesem_elision_config);
+
void init_rtm_spinlocks(void)
{
if (!boot_cpu_has(X86_FEATURE_RTM))
@@ -464,6 +473,7 @@ void init_rtm_spinlocks(void)
static_key_slow_inc(&rwlock_elision);
static_key_slow_inc(&mutex_elision);
+ static_key_slow_inc(&rwsem_elision);
bitlock_elision = true;
}
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 436e430..fb96c565 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -28,6 +28,9 @@ struct rw_semaphore {
long count;
raw_spinlock_t wait_lock;
struct list_head wait_list;
+#ifdef CONFIG_RTM_LOCKS
+ short elision_adapt;
+#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
--
1.7.7.6
From: Andi Kleen <[email protected]>
Document the tunables and the statistics in Documentation/lock-elision.txt
Signed-off-by: Andi Kleen <[email protected]>
---
Documentation/lock-elision.txt | 94 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 Documentation/lock-elision.txt
diff --git a/Documentation/lock-elision.txt b/Documentation/lock-elision.txt
new file mode 100644
index 0000000..bc5a5ac
--- /dev/null
+++ b/Documentation/lock-elision.txt
@@ -0,0 +1,94 @@
+
+The Linux kernel uses Intel TSX lock elision for most of its locking primitives,
+when the hardware supports TSX.
+
+This allows to run lock regions that follow specific conditions to run
+in parallel without explicit blocking in a memory transaction. If the
+transaction fails the code falls back to the normal lock.
+
+The lock elision is implemented using RTM.
+
+To measure transaction success the TSX perf extensions should be used.
+(At the time of this writing this requires unmerged perf patches,
+as in the full Haswell PMU patch series in hsw/pmu* in
+git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git)
+
+perf stat -T ...
+perf record -e tx-aborts ...
+
+The elision code is mostly adaptive, that is the lock predicts whether
+it should elide or not based on the history.
+
+There are various tunables available in /sys/module/rtm_locks/parameters.
+The tunables are grouped by lock types.
+
+Elision can be enabled by lock types:
+
+rwsem_elision Enable elision for read-write semaphores
+mutex_elision Enable elision for mutexes
+spinlock_elision Enable elision for spinlocks
+rwlock_elision Enable elision for read-write spinlocks
+bitlock_elision Enable elision for bit spinlocks
+
+Some lock types use adaptive algorithms. The adaptive algorithm
+has a number of tunables which can be tuned. Each tunable
+is named LOCKTYPE_TUNABLE. For example to tune the conflict retries
+for mutexes do
+
+echo NUMBER > /sys/module/rtm_locks/parameters/mutex_conflict_retry
+
+Valid adaptive lock types are:
+mutex, readlock, writelock, readsem, writesem, spinlock
+
+The current tunables are (subject to change):
+
+In general increasing the skip counts makes lock elision more conservative,
+while increasing retries or timeout makes it more aggressive.
+
+*_elision Global elision enable for the lock type.
+*_conflict_abort_skip Number of elisions to skip when a transaction
+ failed due to a number of retried conflicts.
+*_conflict_retry Number of retries on memory conflicts
+*_internal_abort_skip Number of elision to skip when a transaction
+ failed due to a reason inside the transaction
+ (that is not caused by another CPU)
+*_lock_busy_retry How often to retry wen the lock is busy on
+ elision.
+*_lock_busy_skip How often to skip elision when retries on
+ lock busy failed.
+*_other_abort_skip How often to skip elision when the transaction
+ failed for other reasons (e.g. capacity overflow)
+*_retry_timeout How often to spin while waiting for a lock to free
+ before retrying
+
+Note these tunables do not present an ABI and may change as the internals
+evolve.
+
+Additional statistics:
+
+lock_el_skip Number of times a lock didn't elide due to skipping
+lock_el_skip_start Number of times a lock started skipping
+
+The average skip is lock_el_skip / lock_el_start_skip
+Additional statistics can be gotten using perf PMU TSX events.
+
+There is also a elision:elision_skip_start trace point that triggers every time
+a lock starts skipping elision due to non success.
+
+References:
+"Adding Lock elision to Linux"
+http://halobates.de/adding-lock-elision-to-linux.pdf
+
+"Adding lock elision to the GNU C Library"
+http://lwn.net/Articles/533894/
+
+Full TSX specification:
+http://software.intel.com/file/41417 (chapter 8)
+
+Glossary:
+cache line Aligned 64 byte region
+conflict Another CPU writes to a cache line read in a elided lock region,
+ or reads from a cache line written to in the region.
+abort Rolling back the transaction and undoing its side effect
+
+Andi Kleen
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add simple implementations of arch_spin_unlock_flags/irq to the Xen
paravirt spinlock code.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/xen/spinlock.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index f7a080e..4e22520 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -353,6 +353,21 @@ static void xen_spin_unlock(struct arch_spinlock *lock)
xen_spin_unlock_slow(xl);
}
+/* inline the Xen functions for irqs? */
+
+static void xen_spin_unlock_flags(struct arch_spinlock *lock,
+ unsigned long flags)
+{
+ xen_spin_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static void xen_spin_unlock_irq(struct arch_spinlock *lock)
+{
+ xen_spin_unlock(lock);
+ local_irq_enable();
+}
+
static irqreturn_t dummy_handler(int irq, void *dev_id)
{
BUG();
@@ -395,6 +410,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.spin_lock_flags = xen_spin_lock_flags;
pv_lock_ops.spin_trylock = xen_spin_trylock;
pv_lock_ops.spin_unlock = xen_spin_unlock;
+ pv_lock_ops.spin_unlock_flags = xen_spin_unlock_flags;
+ pv_lock_ops.spin_unlock_irq = xen_spin_unlock_irq;
}
#ifdef CONFIG_XEN_DEBUG_FS
--
1.7.7.6
From: Andi Kleen <[email protected]>
We use the generic elision macros and the mutex hook infrastructure
added earlier. With that attempt to lock elide mutexes using
the elide() macros and the usual elision wrapping pattern.
Lock elision does not allow modifying the lock itself, so it's not
possible to set the lock owner. So we don't do that when the lock
is elided. Lock owner is only used for debugging or adaptive spinning,
but both do not apply with elision. All the checking is still done
when the lock falls back to real locking.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mutex.h | 51 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/rtm-locks.c | 3 ++
2 files changed, 54 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..7f585e3 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -1,5 +1,56 @@
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H 1
+
#ifdef CONFIG_X86_32
# include <asm/mutex_32.h>
#else
# include <asm/mutex_64.h>
#endif
+
+#define ARCH_HAS_MUTEX_AND_OWN 1
+
+#include <linux/elide.h>
+
+extern bool mutex_elision;
+
+/*
+ * Try speculation first and only do the normal locking and owner setting
+ * if that fails.
+ */
+
+#define mutex_free(l) (atomic_read(&(l)->count) == 1)
+
+#define __mutex_fastpath_lock_and_own(l, s) ({ \
+ if (!elide_lock(mutex_elision, \
+ mutex_free(l))) { \
+ __mutex_fastpath_lock(&(l)->count, s); \
+ mutex_set_owner(l); \
+ } \
+ })
+
+#define __mutex_fastpath_unlock_and_unown(l, s) ({ \
+ if (!elide_unlock(mutex_free(l))) { \
+ mutex_unlock_clear_owner(l); \
+ __mutex_fastpath_unlock(&(l)->count, s); \
+ } \
+ })
+
+#define __mutex_fastpath_lock_retval_and_own(l, s) ({ \
+ int ret = 0; \
+ if (!elide_lock(mutex_elision, mutex_free(l))) { \
+ ret = __mutex_fastpath_lock_retval(&(l)->count, s); \
+ if (!ret) \
+ mutex_set_owner(l); \
+ } \
+ ret; })
+
+#define __mutex_fastpath_trylock_and_own(l, s) ({ \
+ int ret = 1; \
+ if (!elide_lock(mutex_elision, mutex_free(l))) {\
+ ret = __mutex_fastpath_trylock(&(l)->count, s); \
+ if (ret) \
+ mutex_set_owner(l); \
+ } \
+ ret; })
+
+#endif
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index 0717050..f3ae8e6 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -348,3 +348,6 @@ void init_rtm_spinlocks(void)
pv_irq_ops.restore_fl = PV_CALLEE_SAVE(rtm_restore_fl);
pv_init_ops.patch = rtm_patch;
}
+
+__read_mostly bool mutex_elision = true;
+module_param(mutex_elision, bool, 0644);
--
1.7.7.6
From: Andi Kleen <[email protected]>
lock_is_locked aborts with lock elision. Some code does a lot of lock asserts,
which causes a lot of aborts. Add a _xtest() here so that the checking is only
done when the lock is not elided. This always happens occasionally due to
fallbacks, so there is still enough assert coverage.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/spinlock_api_smp.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index cf9bf3b..cd9269b 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -5,6 +5,8 @@
# error "please don't include this file directly"
#endif
+#include <linux/rtm.h>
+
/*
* include/linux/spinlock_api_smp.h
*
@@ -17,7 +19,10 @@
int in_lock_functions(unsigned long addr);
-#define assert_raw_spin_locked(x) BUG_ON(!raw_spin_is_locked(x))
+#define assert_raw_spin_locked(x) do { \
+ if (!_xtest()) \
+ BUG_ON(!raw_spin_is_locked(x)); \
+ } while (0)
void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add the standard elide wrapper macros to rwsems to enable
lock elision for rwsems. Main target is mmap_sem.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/rwsem.h | 23 ++++++++++++++++++++++-
arch/x86/kernel/rtm-locks.c | 3 +++
include/linux/rwsem.h | 3 +++
3 files changed, 28 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 2dbe4a7..f9297af 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -38,6 +38,7 @@
#ifdef __KERNEL__
#include <asm/asm.h>
+#include <linux/elide.h>
/*
* The bias values and the counter type limits the number of
@@ -57,11 +58,15 @@
#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+extern bool rwsem_elision;
+
/*
* lock for reading
*/
static inline void __down_read(struct rw_semaphore *sem)
{
+ if (elide_lock(rwsem_elision, sem->count == 0))
+ return;
asm volatile("# beginning down_read\n\t"
LOCK_PREFIX _ASM_INC "(%1)\n\t"
/* adds 0x00000001 */
@@ -80,6 +85,9 @@ static inline void __down_read(struct rw_semaphore *sem)
static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long result, tmp;
+
+ if (elide_lock(rwsem_elision, sem->count == 0))
+ return 1;
asm volatile("# beginning __down_read_trylock\n\t"
" mov %0,%1\n\t"
"1:\n\t"
@@ -102,6 +110,9 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
{
long tmp;
+
+ if (elide_lock(rwsem_elision, sem->count == 0))
+ return;
asm volatile("# beginning down_write\n\t"
LOCK_PREFIX " xadd %1,(%2)\n\t"
/* adds 0xffff0001, returns the old value */
@@ -126,7 +137,11 @@ static inline void __down_write(struct rw_semaphore *sem)
*/
static inline int __down_write_trylock(struct rw_semaphore *sem)
{
- long ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ long ret;
+
+ if (elide_lock(rwsem_elision, sem->count == 0))
+ return 1;
+ ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
RWSEM_ACTIVE_WRITE_BIAS);
if (ret == RWSEM_UNLOCKED_VALUE)
return 1;
@@ -139,6 +154,9 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
static inline void __up_read(struct rw_semaphore *sem)
{
long tmp;
+
+ if (elide_unlock(sem->count == 0))
+ return;
asm volatile("# beginning __up_read\n\t"
LOCK_PREFIX " xadd %1,(%2)\n\t"
/* subtracts 1, returns the old value */
@@ -157,6 +175,9 @@ static inline void __up_read(struct rw_semaphore *sem)
static inline void __up_write(struct rw_semaphore *sem)
{
long tmp;
+
+ if (elide_unlock(sem->count == 0))
+ return;
asm volatile("# beginning __up_write\n\t"
LOCK_PREFIX " xadd %1,(%2)\n\t"
/* subtracts 0xffff0001, returns the old value */
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index f3ae8e6..1811028 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -351,3 +351,6 @@ void init_rtm_spinlocks(void)
__read_mostly bool mutex_elision = true;
module_param(mutex_elision, bool, 0644);
+
+__read_mostly bool rwsem_elision = true;
+module_param(rwsem_elision, bool, 0644);
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 8da67d6..436e430 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -16,6 +16,8 @@
#include <linux/atomic.h>
+#include <linux/elide.h>
+
struct rw_semaphore;
#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
@@ -42,6 +44,7 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
{
+ elide_abort();
return sem->count != 0;
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
Inside a elided mutex we cannot tell if the mutex is really locked
or not. Aborting it he safe answer.
Callers who frequently abort (e.g. BUG_ONs) need to be fixed
separately.
Noop without RTM.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/mutex.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..0574095 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -14,6 +14,7 @@
#include <linux/spinlock_types.h>
#include <linux/linkage.h>
#include <linux/lockdep.h>
+#include <linux/elide.h>
#include <linux/atomic.h>
@@ -123,6 +124,7 @@ extern void __mutex_init(struct mutex *lock, const char *name,
*/
static inline int mutex_is_locked(struct mutex *lock)
{
+ elide_abort();
return atomic_read(&lock->count) != 1;
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
Elided locks do not allow writing to the lock cache line in the fast
path. This would abort the transaction. They do not actually need
an owner in the speculative fast path, because they do not actually
take the lock. But in the slow path when the lock is taken
they actually want setting the owner, so that adaptive spinning
works correctly.
Right now the mutex code does only allow the architecture to either
opt in fully to always owner or never. Add new fast path wrappers
that combine owner setting with the fast path call and that can
be overwritten by the architecture.
This lets the RTM code only write the owner when not eliding.
This is the initial patch that just moves the code out into
new macros. Noop in itself.
Signed-off-by: Andi Kleen <[email protected]>
---
kernel/mutex.c | 79 ++++++++++++++++++++++++++++++++++----------------------
1 files changed, 48 insertions(+), 31 deletions(-)
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..982b136 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -26,6 +26,17 @@
#include <linux/debug_locks.h>
/*
+ * When debugging is enabled we must not clear the owner before time,
+ * the slow path will always be taken, and that clears the owner field
+ * after verifying that it was indeed current.
+ */
+#ifndef CONFIG_DEBUG_MUTEXES
+#define mutex_unlock_clear_owner(l) mutex_clear_owner(l)
+#else
+#define mutex_unlock_clear_owner(l) do {} while(0)
+#endif
+
+/*
* In the DEBUG case we are using the "NULL fastpath" for mutexes,
* which forces all calls into the slowpath:
*/
@@ -37,6 +48,36 @@
# include <asm/mutex.h>
#endif
+/*
+ * Eliding locks cannot support an owner, so allow the architecture
+ * to disable owner for this case only.
+ *
+ * Below are the default implementations to be used when the architecture
+ * doesn't do anything special.
+ *
+ * Note this cannot be inlines because "s" is a label passed to assembler.
+ */
+#ifndef ARCH_HAS_MUTEX_AND_OWN
+#define __mutex_fastpath_lock_and_own(l, s) ({ \
+ __mutex_fastpath_lock(&(l)->count, s); \
+ mutex_set_owner(l); })
+#define __mutex_fastpath_unlock_and_unown(l, s) ({ \
+ mutex_unlock_clear_owner(l); \
+ __mutex_fastpath_unlock(&(l)->count, s); })
+#define __mutex_fastpath_lock_retval_and_own(l, s) ({ \
+ int ret; \
+ ret = __mutex_fastpath_lock_retval(&(l)->count, s); \
+ if (!ret) \
+ mutex_set_owner(l); \
+ ret; })
+#define __mutex_fastpath_trylock_and_own(l, s) ({ \
+ int ret; \
+ ret = __mutex_fastpath_trylock(&(l)->count, s); \
+ if (ret) \
+ mutex_set_owner(l); \
+ ret; })
+#endif
+
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
@@ -88,8 +129,7 @@ void __sched mutex_lock(struct mutex *lock)
* The locking fastpath is the 1->0 transition from
* 'unlocked' into 'locked' state.
*/
- __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
- mutex_set_owner(lock);
+ __mutex_fastpath_lock_and_own(lock, __mutex_lock_slowpath);
}
EXPORT_SYMBOL(mutex_lock);
@@ -114,15 +154,7 @@ void __sched mutex_unlock(struct mutex *lock)
* The unlocking fastpath is the 0->1 transition from 'locked'
* into 'unlocked' state:
*/
-#ifndef CONFIG_DEBUG_MUTEXES
- /*
- * When debugging is enabled we must not clear the owner before time,
- * the slow path will always be taken, and that clears the owner field
- * after verifying that it was indeed current.
- */
- mutex_clear_owner(lock);
-#endif
- __mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
+ __mutex_fastpath_unlock_and_unown(lock, __mutex_unlock_slowpath);
}
EXPORT_SYMBOL(mutex_unlock);
@@ -372,11 +404,8 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
int ret;
might_sleep();
- ret = __mutex_fastpath_lock_retval
- (&lock->count, __mutex_lock_interruptible_slowpath);
- if (!ret)
- mutex_set_owner(lock);
-
+ return __mutex_fastpath_lock_retval_and_own(lock,
+ __mutex_lock_interruptible_slowpath);
return ret;
}
@@ -384,15 +413,9 @@ EXPORT_SYMBOL(mutex_lock_interruptible);
int __sched mutex_lock_killable(struct mutex *lock)
{
- int ret;
-
might_sleep();
- ret = __mutex_fastpath_lock_retval
- (&lock->count, __mutex_lock_killable_slowpath);
- if (!ret)
- mutex_set_owner(lock);
-
- return ret;
+ return __mutex_fastpath_lock_retval_and_own(lock,
+ __mutex_lock_killable_slowpath);
}
EXPORT_SYMBOL(mutex_lock_killable);
@@ -464,13 +487,7 @@ static inline int __mutex_trylock_slowpath(atomic_t *lock_count)
*/
int __sched mutex_trylock(struct mutex *lock)
{
- int ret;
-
- ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
- if (ret)
- mutex_set_owner(lock);
-
- return ret;
+ return __mutex_fastpath_trylock_and_own(lock, __mutex_trylock_slowpath);
}
EXPORT_SYMBOL(mutex_trylock);
--
1.7.7.6
From: Andi Kleen <[email protected]>
This is mainly useful for simple statistic counters.
Essentially read-only, writing only clears.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/moduleparam.h | 4 ++++
kernel/params.c | 28 ++++++++++++++++++++++++++++
2 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index 137b419..e3c41af 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -394,6 +394,10 @@ extern int param_set_bint(const char *val, const struct kernel_param *kp);
#define param_get_bint param_get_int
#define param_check_bint param_check_int
+/* A per cpu read-only unsigned integer. Useful for counters. */
+extern struct kernel_param_ops param_ops_percpu_uint;
+#define param_check_percpu_uint(name, p) param_check_uint
+
/**
* module_param_array - a parameter which is an array of some type
* @name: the name of the array variable
diff --git a/kernel/params.c b/kernel/params.c
index ed35345..1a8a1ca 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -499,6 +499,34 @@ struct kernel_param_ops param_ops_string = {
};
EXPORT_SYMBOL(param_ops_string);
+/* Per CPU read only module param. */
+
+static int param_percpu_uint_set(const char *val, const struct kernel_param *kp)
+{
+ int cpu;
+
+ /* just clear on any write */
+ for_each_possible_cpu(cpu)
+ *per_cpu_ptr((unsigned * __percpu)(kp->arg), cpu) = 0;
+ return 0;
+}
+
+static int param_percpu_uint_get(char *buffer, const struct kernel_param *kp)
+{
+ int cpu;
+ unsigned count = 0;
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr((unsigned * __percpu)(kp->arg), cpu);
+ return sprintf(buffer, "%u", count);
+}
+
+struct kernel_param_ops param_ops_percpu_uint = {
+ .set = param_percpu_uint_set,
+ .get = param_percpu_uint_get,
+};
+EXPORT_SYMBOL(param_ops_percpu_uint);
+
/* sysfs output in /sys/modules/XYZ/parameters/ */
#define to_module_attr(n) container_of(n, struct module_attribute, attr)
#define to_module_kobject(n) container_of(n, struct module_kobject, kobj)
--
1.7.7.6
From: Andi Kleen <[email protected]>
The TSX RTM lock elision code needs to distinguish unlocks that reenable
interrupts from others. Add arch_read/write_unlock_irq/flags for rwlocks
similar to the ones for spinlocks. This is opt in by the architecture.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rwlock.h | 76 ++++++++++++++++++++++++++++++++++++++-
include/linux/rwlock_api_smp.h | 12 ++----
2 files changed, 78 insertions(+), 10 deletions(-)
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index bc2994e..82a7f61 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -14,6 +14,34 @@
* Released under the General Public License (GPL).
*/
+#if !defined(ARCH_HAS_RWLOCK_UNLOCK_IRQ) && defined(CONFIG_SMP)
+static inline void arch_read_unlock_irqrestore(arch_rwlock_t *lock,
+ unsigned long flags)
+{
+ arch_read_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void arch_read_unlock_irq(arch_rwlock_t *lock)
+{
+ arch_read_unlock(lock);
+ local_irq_enable();
+}
+
+static inline void arch_write_unlock_irqrestore(arch_rwlock_t *lock,
+ unsigned long flags)
+{
+ arch_write_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void arch_write_unlock_irq(arch_rwlock_t *lock)
+{
+ arch_write_unlock(lock);
+ local_irq_enable();
+}
+#endif
+
#ifdef CONFIG_DEBUG_SPINLOCK
extern void __rwlock_init(rwlock_t *lock, const char *name,
struct lock_class_key *key);
@@ -37,17 +65,61 @@ do { \
#define do_raw_write_lock_flags(lock, flags) do_raw_write_lock(lock)
extern int do_raw_write_trylock(rwlock_t *lock);
extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock);
+
+static inline void do_raw_write_unlock_irq(rwlock_t *lock) __releases(lock)
+{
+ do_raw_write_unlock(lock);
+ local_irq_enable();
+}
+
+static inline void do_raw_read_unlock_irq(rwlock_t *lock) __releases(lock)
+{
+ do_raw_read_unlock(lock);
+ local_irq_enable();
+}
+
+static inline void do_raw_write_unlock_irqrestore(rwlock_t *lock,
+ unsigned long flags)
+ __releases(lock)
+{
+ do_raw_write_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void do_raw_read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
+ __releases(lock)
+{
+ do_raw_read_unlock(lock);
+ local_irq_restore(flags);
+}
+
#else
# define do_raw_read_lock(rwlock) do {__acquire(lock); arch_read_lock(&(rwlock)->raw_lock); } while (0)
-# define do_raw_read_lock_flags(lock, flags) \
- do {__acquire(lock); arch_read_lock_flags(&(lock)->raw_lock, *(flags)); } while (0)
+# define do_raw_read_lock_flags(lock, flags) \
+ do { __acquire(lock); \
+ arch_read_lock_flags(&(lock)->raw_lock, *(flags)); } while (0)
# define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock)
# define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
+# define do_raw_read_unlock_irqrestore(rwlock, flags) \
+ do { \
+ arch_read_unlock_irqrestore(&(rwlock)->raw_lock, flags); \
+ __release(lock); \
+ } while (0)
+# define do_raw_read_unlock_irq(rwlock) \
+ do {arch_read_unlock_irq(&(rwlock)->raw_lock); __release(lock); } while (0)
# define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0)
# define do_raw_write_lock_flags(lock, flags) \
do {__acquire(lock); arch_write_lock_flags(&(lock)->raw_lock, *(flags)); } while (0)
# define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock)
# define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
+# define do_raw_write_unlock_irqrestore(rwlock, flags) \
+ do { \
+ arch_write_unlock_irqrestore(&(rwlock)->raw_lock, flags); \
+ __release(lock); \
+ } while (0)
+# define do_raw_write_unlock_irq(rwlock) \
+ do { arch_write_unlock_irq(&(rwlock)->raw_lock); \
+ __release(lock); } while (0)
#endif
#define read_can_lock(rwlock) arch_read_can_lock(&(rwlock)->raw_lock)
diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
index 9c9f049..8ac4a73 100644
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -233,16 +233,14 @@ static inline void
__raw_read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
{
rwlock_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_read_unlock(lock);
- local_irq_restore(flags);
+ do_raw_read_unlock_irqrestore(lock, flags);
preempt_enable();
}
static inline void __raw_read_unlock_irq(rwlock_t *lock)
{
rwlock_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_read_unlock(lock);
- local_irq_enable();
+ do_raw_read_unlock_irq(lock);
preempt_enable();
}
@@ -258,16 +256,14 @@ static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
unsigned long flags)
{
rwlock_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_write_unlock(lock);
- local_irq_restore(flags);
+ do_raw_write_unlock_irqrestore(lock, flags);
preempt_enable();
}
static inline void __raw_write_unlock_irq(rwlock_t *lock)
{
rwlock_release(&lock->dep_map, 1, _RET_IP_);
- do_raw_write_unlock(lock);
- local_irq_enable();
+ do_raw_write_unlock_irq(lock);
preempt_enable();
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
May be removed later, but useful for basic sanity checking.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/rtm-test.c | 22 ++++++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/rtm-test.c
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7bd3bd3..f46aebd 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -114,3 +114,5 @@ ifeq ($(CONFIG_X86_64),y)
obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o
obj-y += vsmp_64.o
endif
+
+obj-y += rtm-test.o
diff --git a/arch/x86/kernel/rtm-test.c b/arch/x86/kernel/rtm-test.c
new file mode 100644
index 0000000..5e138c1
--- /dev/null
+++ b/arch/x86/kernel/rtm-test.c
@@ -0,0 +1,22 @@
+#include <asm/rtm.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+
+int x;
+
+static __init int rtm_test(void)
+{
+ unsigned status;
+
+ pr_info("simple rtm test\n");
+ if ((status = _xbegin()) == _XBEGIN_STARTED) {
+ x++;
+ _xend();
+ pr_info("transaction committed\n");
+ } else {
+ pr_info("transaction aborted %x\n", status);
+ }
+ return 0;
+}
+
+device_initcall(rtm_test);
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add support for arch_spin_unlock_flags/irq
Only done for the paravirt case, non paravirt just uses the wrappers
in the upper level code.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/paravirt.h | 13 +++++++++++++
arch/x86/include/asm/paravirt_types.h | 3 +++
arch/x86/kernel/paravirt-spinlocks.c | 16 ++++++++++++++++
3 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..3d5a0b1 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -745,6 +745,19 @@ static __always_inline void arch_spin_unlock(struct arch_spinlock *lock)
PVOP_VCALL1(pv_lock_ops.spin_unlock, lock);
}
+static __always_inline void arch_spin_unlock_flags(struct arch_spinlock *lock,
+ unsigned long flags)
+{
+ PVOP_VCALL2(pv_lock_ops.spin_unlock_flags, lock, flags);
+}
+
+static __always_inline void arch_spin_unlock_irq(struct arch_spinlock *lock)
+{
+ PVOP_VCALL1(pv_lock_ops.spin_unlock_irq, lock);
+}
+
+#define ARCH_HAS_SPIN_UNLOCK_IRQ 1
+
#endif
#ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 142236e..b428393 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,6 +333,9 @@ struct pv_lock_ops {
void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long flags);
int (*spin_trylock)(struct arch_spinlock *lock);
void (*spin_unlock)(struct arch_spinlock *lock);
+ void (*spin_unlock_flags)(struct arch_spinlock *lock,
+ unsigned long flags);
+ void (*spin_unlock_irq)(struct arch_spinlock *lock);
};
/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 676b8c7..c41fc8c 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -13,6 +13,20 @@ default_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
arch_spin_lock(lock);
}
+static void
+default_spin_unlock_flags(arch_spinlock_t *lock, unsigned long flags)
+{
+ arch_spin_unlock(lock);
+ local_irq_restore(flags);
+}
+
+static inline void
+default_spin_unlock_irq(arch_spinlock_t *lock)
+{
+ arch_spin_unlock(lock);
+ local_irq_enable();
+}
+
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
.spin_is_locked = __ticket_spin_is_locked,
@@ -22,6 +36,8 @@ struct pv_lock_ops pv_lock_ops = {
.spin_lock_flags = default_spin_lock_flags,
.spin_trylock = __ticket_spin_trylock,
.spin_unlock = __ticket_spin_unlock,
+ .spin_unlock_irq = default_spin_unlock_flags,
+ .spin_unlock_flags = default_spin_unlock_irq,
#endif
};
EXPORT_SYMBOL(pv_lock_ops);
--
1.7.7.6
From: Andi Kleen <[email protected]>
For lock elision we (mostly) use generic elide() macros that
can be added to the lock code with minimal intrusion. Add a generic
version that does nothing and is used when RTM is not available.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/elide.h | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)
create mode 100644 include/linux/elide.h
diff --git a/include/linux/elide.h b/include/linux/elide.h
new file mode 100644
index 0000000..95b3ec2
--- /dev/null
+++ b/include/linux/elide.h
@@ -0,0 +1,18 @@
+#ifndef _LINUX_ELIDE_H
+#define _LINUX_ELIDE_H 1
+
+#include <linux/rtm.h>
+
+#ifdef CONFIG_RTM_LOCKS
+#include <asm/elide.h>
+#else
+#define elide_lock(l, f) 0
+#define elide_lock_adapt(f, l, a, ac) 0
+#define elide_unlock(l) 0
+#define elide_abort() do {} while (0)
+struct elision_config {};
+#define DEFAULT_ELISION_CONFIG {}
+#define TUNE_ELISION_CONFIG(a, b)
+#endif
+
+#endif
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add basic TSX lock elision infrastructure. This is implemented
using RTM to give more flexibility. A lock is elided by
wrapping an elision check around it: when the lock is free
try to speculatively execute the lock region and fall back
if that fails.
Provide some generic macros to add lock elision wrapping
to different lock types.
Patch into the spinlocks using paravirt ops. We also
have to intercept cli/sti to avoiding aborts due to
changing the interrupt flag (see the comment in the source
for more details)
Since paravirt ops cannot be stacked this implies currently
that either a pvops using hypervisor or elision are active,
but not both at the same time. This is likely fixable.
For read write locks and other locks we have to use direct hooks
(added in followon patches)
All elision can be enabled/disabled through module params.
We also use the module params for tuning and exporting statistics.
While that is slightly unusal, it leads to very simple
and concise code.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/Kconfig | 15 ++
arch/x86/include/asm/elide.h | 58 ++++++
arch/x86/include/asm/rtm-locks.h | 18 ++
arch/x86/include/asm/setup.h | 6 +
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/paravirt-spinlocks.c | 4 +-
arch/x86/kernel/rtm-locks.c | 350 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/setup.c | 1 +
8 files changed, 453 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/elide.h
create mode 100644 arch/x86/include/asm/rtm-locks.h
create mode 100644 arch/x86/kernel/rtm-locks.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 70c0f3d..015db67 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -656,9 +656,24 @@ config PARAVIRT_SPINLOCKS
If you are unsure how to answer this question, answer N.
+config RTM_LOCKS
+ bool "RTM elided locks"
+ depends on PARAVIRT && SMP && !LOCKDEP
+ ---help---
+ Use TSX enabled locks.
+ This allows to elide locks systems that support Intel TSX.
+ Eliding locks allows to run the locks in parallel when various
+ conditions are met. On systems without TSX this will do nothing.
+ This uses the same mechanisms as used for paravirtualized locks
+ on Xen and other hypervisors and only either can be used.
+
config PARAVIRT_CLOCK
bool
+config ARCH_HAS_ELISION
+ def_bool y
+ depends on RTM_LOCKS
+
endif
config PARAVIRT_DEBUG
diff --git a/arch/x86/include/asm/elide.h b/arch/x86/include/asm/elide.h
new file mode 100644
index 0000000..c492aed
--- /dev/null
+++ b/arch/x86/include/asm/elide.h
@@ -0,0 +1,58 @@
+#ifndef _ASM_ELIDE_H
+#define _ASM_ELIDE_H 1
+
+#ifdef CONFIG_RTM_LOCKS
+#include <asm/rtm.h>
+
+/*
+ * These are out of line unfortunately, just to avoid
+ * a nasty include loop with per cpu data.
+ * (FIXME)
+ */
+extern int __elide_lock(void);
+extern void __elide_unlock(void);
+
+/*
+ * Simple lock elision wrappers for locks.
+ * f is the static key that enables/disables elision
+ * l must be evaluated by the macro later, and yield 1
+ * when the lock is free.
+ *
+ * TBD should use static_keys too, but that needs
+ * more changes to avoid include loop hell with users.
+ */
+
+#define elide_lock(f, l) ({ \
+ int flag = 0; \
+ if ((f) && __elide_lock()) { \
+ if (l) \
+ flag = 1; \
+ else \
+ _xabort(0xff); \
+ } \
+ flag; \
+})
+
+/*
+ * Note that if you see a general protection fault
+ * in the _xend you have a unmatched unlock. Please fix
+ * your code.
+ */
+
+#define elide_unlock(l) ({ \
+ int flag = 0; \
+ if (l) { \
+ __elide_unlock(); \
+ flag = 1; \
+ } \
+ flag; \
+})
+
+/*
+ * Use for code that cannot elide, primarily code that queries
+ * the lock state.
+ */
+#define elide_abort() _xabort(0xfe)
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/rtm-locks.h b/arch/x86/include/asm/rtm-locks.h
new file mode 100644
index 0000000..1a7ff2a
--- /dev/null
+++ b/arch/x86/include/asm/rtm-locks.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_RTM_LOCKS
+#define _ASM_RTM_LOCKS 1
+
+#include <asm/rwlock.h>
+
+/* rwlocks */
+
+void rtm_read_lock(arch_rwlock_t *rw);
+void rtm_read_unlock(arch_rwlock_t *rw);
+void rtm_read_unlock_irq(arch_rwlock_t *rw);
+void rtm_read_unlock_irqrestore(arch_rwlock_t *rw, unsigned long flags);
+int rtm_read_trylock(arch_rwlock_t *rw);
+void rtm_write_lock(arch_rwlock_t *rw);
+void rtm_write_unlock(arch_rwlock_t *rw);
+void rtm_write_unlock_irq(arch_rwlock_t *rw);
+void rtm_write_unlock_irqrestore(arch_rwlock_t *rw, unsigned long flags);
+
+#endif
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..3fbfdaf 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -34,6 +34,12 @@ void vsmp_init(void);
static inline void vsmp_init(void) { }
#endif
+#ifdef CONFIG_RTM_LOCKS
+void init_rtm_spinlocks(void);
+#else
+static inline void init_rtm_spinlocks(void) { }
+#endif
+
void setup_bios_corruption_check(void);
#ifdef CONFIG_X86_VISWS
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f46aebd..995c788 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -14,6 +14,7 @@ CFLAGS_REMOVE_pvclock.o = -pg
CFLAGS_REMOVE_kvmclock.o = -pg
CFLAGS_REMOVE_ftrace.o = -pg
CFLAGS_REMOVE_early_printk.o = -pg
+CFLAGS_REMOVE_rtm-spinlocks.o = -pg
endif
obj-y := process_$(BITS).o signal.o entry_$(BITS).o
@@ -103,6 +104,8 @@ obj-$(CONFIG_UPROBES) += uprobes.o
obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
+obj-$(CONFIG_RTM_LOCKS) += rtm-locks.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index c41fc8c..1451956 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -36,8 +36,8 @@ struct pv_lock_ops pv_lock_ops = {
.spin_lock_flags = default_spin_lock_flags,
.spin_trylock = __ticket_spin_trylock,
.spin_unlock = __ticket_spin_unlock,
- .spin_unlock_irq = default_spin_unlock_flags,
- .spin_unlock_flags = default_spin_unlock_irq,
+ .spin_unlock_irq = default_spin_unlock_irq,
+ .spin_unlock_flags = default_spin_unlock_flags,
#endif
};
EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
new file mode 100644
index 0000000..0717050
--- /dev/null
+++ b/arch/x86/kernel/rtm-locks.c
@@ -0,0 +1,350 @@
+/*
+ * Intel TSX RTM (Restricted Transactional Memory) lock elision.
+ * Lock elision allows to run locks in parallel using transactional memory.
+ *
+ * (C) Copyright 2012, 2013 Intel Corporation
+ * Author: Andi Kleen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Adds a fast path for locks. Run each lock speculatively in a hardware
+ * memory transaction implemented by the CPU. When the transaction succeeds
+ * the lock will have executed in parallel without blocking.
+ *
+ * If the transaction aborts (due to memory conflicts or other causes)
+ * eventually fall back to normal locking.
+ *
+ * For spinlocks use paravirt ops to hook in the RTM lock elision. For
+ * interrupt disabling we also use the pvops to patch in our own code
+ * that avoids aborts. For other locks that are not supported by pvops
+ * use direct hooks.
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/elide.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/jump_label.h>
+#include <asm/rtm.h>
+#include <asm/paravirt.h>
+
+/*
+ * We need a software in_tx marker, to answer the question
+ * "Is this an inner nested transaction commit?" inside the transaction.
+ * XTEST unfortunately does not tell us that.
+ *
+ * This is needed to handle
+ *
+ * spin_lock(x)
+ * spin_lock_irqsave(y, flags)
+ * spin_unlock(y) // no _irqrestore
+ * spin_unlock(x)
+ * ... code that relies on interrupts disabled ...
+ * local_irq_restore(flags)
+ *
+ * If the outermost spin_lock has the irqsave there is no problem
+ * because we just disable/reenable interrupts outside the transaction.
+ * But we cannot do that for a nested spin lock, because disabling
+ * interrupts would abort. Normally we don't need to disable
+ * interrupts in a transaction anyways because any interrupt aborts.
+ * But there's no way to atomically disable the interrupts on
+ * unlock/commit and keep them disabled after the transaction.
+ *
+ * The current solution is to detect the non matched unlock and abort
+ * (and fix code which does that frequently). This needs the software
+ * in_tx counter.
+ */
+
+static DEFINE_PER_CPU(int, in_tx);
+static DEFINE_PER_CPU(bool, cli_elided);
+
+#define start_in_tx() __this_cpu_inc(in_tx)
+#define end_in_tx() __this_cpu_dec(in_tx)
+#define is_in_tx() __this_cpu_read(in_tx)
+
+static struct static_key spinlock_elision = STATIC_KEY_INIT_TRUE;
+module_param(spinlock_elision, static_key, 0644);
+
+static int rtm_spin_trylock(struct arch_spinlock *lock)
+{
+ if (elide_lock(spinlock_elision, !__ticket_spin_is_locked(lock)))
+ return 1;
+ return __ticket_spin_trylock(lock);
+}
+
+static inline void rtm_spin_lock(struct arch_spinlock *lock)
+{
+ if (!elide_lock(spinlock_elision, !__ticket_spin_is_locked(lock)))
+ __ticket_spin_lock(lock);
+}
+
+static void rtm_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags)
+{
+ rtm_spin_lock(lock);
+}
+
+static inline void
+rtm_spin_unlock_check(struct arch_spinlock *lock, bool not_enabling)
+{
+ /*
+ * Note when you get a #GP here this usually means that you
+ * unlocked a lock that was not locked. Please fix your code.
+ */
+ if (!__ticket_spin_is_locked(lock)) {
+ /*
+ * Unlock without restoring interrupts without restoring
+ * interrupts that were disabled nested.
+ * In this case we have to abort.
+ */
+ if (not_enabling && this_cpu_read(cli_elided) &&
+ this_cpu_read(in_tx) == 1)
+ _xabort(0xfc);
+ end_in_tx();
+ _xend();
+ } else
+ __ticket_spin_unlock(lock);
+}
+
+static void rtm_spin_unlock(struct arch_spinlock *lock)
+{
+ rtm_spin_unlock_check(lock, true);
+}
+
+static void rtm_spin_unlock_flags(struct arch_spinlock *lock,
+ unsigned long flags)
+{
+ rtm_spin_unlock_check(lock, !(flags & X86_EFLAGS_IF));
+ local_irq_restore(flags);
+}
+
+static void rtm_spin_unlock_irq(struct arch_spinlock *lock)
+{
+ rtm_spin_unlock_check(lock, false);
+ local_irq_enable();
+}
+
+static int rtm_spin_is_locked(struct arch_spinlock *lock)
+{
+ /*
+ * Cannot tell reliably if the lock is locked or not
+ * when we're in a transaction. So abort instead.
+ */
+ _xabort(0xfe);
+ return __ticket_spin_is_locked(lock);
+}
+
+/*
+ * rwlocks: both readers and writers freely speculate.
+ * This uses direct calls with static patching, not pvops.
+ */
+
+__read_mostly bool rwlock_elision = true;
+module_param(rwlock_elision, bool, 0644);
+
+void rtm_read_lock(arch_rwlock_t *rw)
+{
+ /*
+ * Abort when there is a writer.
+ * In principle we don't care about readers here,
+ * but since they are on the same cache line they
+ * would abort anyways.
+ */
+
+ if (!elide_lock(rwlock_elision, !arch_rwlock_is_locked(rw)))
+ arch_do_read_lock(rw);
+}
+EXPORT_SYMBOL(rtm_read_lock);
+
+static inline void rtm_read_unlock_check(arch_rwlock_t *rw, bool not_enabling)
+{
+ /*
+ * Note when you get a #GP here this usually means that you
+ * unlocked a lock that was not locked. Please fix your code.
+ */
+ if (!arch_rwlock_is_locked(rw)) {
+ if (not_enabling && this_cpu_read(cli_elided) &&
+ this_cpu_read(in_tx) == 1)
+ _xabort(0xfd);
+ end_in_tx();
+ _xend();
+ } else
+ arch_do_read_unlock(rw);
+}
+
+void rtm_read_unlock(arch_rwlock_t *rw)
+{
+ rtm_read_unlock_check(rw, true);
+}
+EXPORT_SYMBOL(rtm_read_unlock);
+
+void rtm_read_unlock_irq(arch_rwlock_t *rw)
+{
+ rtm_read_unlock_check(rw, false);
+ local_irq_enable();
+}
+EXPORT_SYMBOL(rtm_read_unlock_irq);
+
+void rtm_read_unlock_irqrestore(arch_rwlock_t *rw, unsigned long flags)
+{
+ rtm_read_unlock_check(rw, !(flags & X86_EFLAGS_IF));
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(rtm_read_unlock_irqrestore);
+
+int rtm_read_trylock(arch_rwlock_t *rw)
+{
+ if (elide_lock(rwlock_elision, !arch_rwlock_is_locked(rw)))
+ return 1;
+ return arch_do_read_trylock(rw);
+}
+EXPORT_SYMBOL(rtm_read_trylock);
+
+void rtm_write_lock(arch_rwlock_t *rw)
+{
+ if (!elide_lock(rwlock_elision, !arch_write_can_lock(rw)))
+ arch_do_write_lock(rw);
+}
+EXPORT_SYMBOL(rtm_write_lock);
+
+static inline void rtm_write_unlock_check(arch_rwlock_t *rw, bool not_enabling)
+{
+ /*
+ * Note when you get a #GP here this usually means that you
+ * unlocked a lock that was not locked. Please fix your code.
+ */
+ if (!arch_rwlock_is_locked(rw)) {
+ if (not_enabling && this_cpu_read(cli_elided) &&
+ this_cpu_read(in_tx) == 1)
+ _xabort(0xfd);
+ end_in_tx();
+ _xend();
+ } else
+ arch_do_write_unlock(rw);
+}
+
+void rtm_write_unlock(arch_rwlock_t *rw)
+{
+ rtm_write_unlock_check(rw, true);
+}
+EXPORT_SYMBOL(rtm_write_unlock);
+
+void rtm_write_unlock_irq(arch_rwlock_t *rw)
+{
+ rtm_write_unlock_check(rw, false);
+ local_irq_enable();
+}
+EXPORT_SYMBOL(rtm_write_unlock_irq);
+
+void rtm_write_unlock_irqrestore(arch_rwlock_t *rw, unsigned long flags)
+{
+ rtm_write_unlock_check(rw, !(flags & X86_EFLAGS_IF));
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(rtm_write_unlock_irqrestore);
+
+/*
+ * This should be in the headers for inlining, but include loop hell
+ * prevents it.
+ */
+
+inline int __elide_lock(void)
+{
+ if (!txn_disabled() && _xbegin() == _XBEGIN_STARTED) {
+ start_in_tx();
+ return 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(__elide_lock);
+
+inline void __elide_unlock(void)
+{
+ /*
+ * Note when you get a #GP here this usually means that you
+ * unlocked a lock that was not locked. Please fix your code.
+ */
+ end_in_tx();
+ _xend();
+}
+EXPORT_SYMBOL(__elide_unlock);
+
+
+/*
+ * CLI aborts, so avoid it inside transactions
+ *
+ * Could also turn non txn cli into transactions?
+ */
+
+static void rtm_restore_fl(unsigned long flags)
+{
+ if (flags & X86_EFLAGS_IF)
+ this_cpu_write(cli_elided, false);
+ if (!_xtest())
+ native_restore_fl(flags);
+}
+PV_CALLEE_SAVE_REGS_THUNK(rtm_restore_fl);
+
+static void rtm_irq_disable(void)
+{
+ if (!_xtest())
+ native_irq_disable();
+ else if (native_save_fl() & X86_EFLAGS_IF)
+ this_cpu_write(cli_elided, true);
+}
+PV_CALLEE_SAVE_REGS_THUNK(rtm_irq_disable);
+
+static void rtm_irq_enable(void)
+{
+ if (!_xtest())
+ native_irq_enable();
+ this_cpu_write(cli_elided, false);
+}
+PV_CALLEE_SAVE_REGS_THUNK(rtm_irq_enable);
+
+static unsigned rtm_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
+{
+ switch (type) {
+ case PARAVIRT_PATCH(pv_irq_ops.irq_enable):
+ case PARAVIRT_PATCH(pv_irq_ops.irq_disable):
+ case PARAVIRT_PATCH(pv_irq_ops.restore_fl):
+ return paravirt_patch_default(type, clobbers, ibuf, addr, len);
+ default:
+ return native_patch(type, clobbers, ibuf, addr, len);
+ }
+}
+
+void init_rtm_spinlocks(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_RTM))
+ return;
+
+ if (strcmp(pv_info.name, "bare hardware")) {
+ pr_info("No TSX lock elision because of conflicting paravirt ops\n");
+ return;
+ }
+
+ pr_info("Enabling TSX based elided spinlocks\n");
+ pv_info.name = "rtm locking";
+ /* spin_is_contended will lie now */
+ pv_lock_ops.spin_lock = rtm_spin_lock;
+ pv_lock_ops.spin_lock_flags = rtm_spin_lock_flags;
+ pv_lock_ops.spin_trylock = rtm_spin_trylock;
+ pv_lock_ops.spin_unlock = rtm_spin_unlock;
+ pv_lock_ops.spin_unlock_flags = rtm_spin_unlock_flags;
+ pv_lock_ops.spin_unlock_irq = rtm_spin_unlock_irq;
+ pv_lock_ops.spin_is_locked = rtm_spin_is_locked;
+
+ pv_irq_ops.irq_disable = PV_CALLEE_SAVE(rtm_irq_disable);
+ pv_irq_ops.irq_enable = PV_CALLEE_SAVE(rtm_irq_enable);
+ pv_irq_ops.restore_fl = PV_CALLEE_SAVE(rtm_restore_fl);
+ pv_init_ops.patch = rtm_patch;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 90d8cc9..a888c48 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1101,6 +1101,7 @@ void __init setup_arch(char **cmdline_p)
reserve_crashkernel();
vsmp_init();
+ init_rtm_spinlocks();
io_delay_init();
--
1.7.7.6
From: Andi Kleen <[email protected]>
For tuning the adaptive locking algorithms it's useful to trace adaptive
elision skipping. Add a trace point for this case.
Used in followon patches
Signed-off-by: Andi Kleen <[email protected]>
---
include/trace/events/elision.h | 31 +++++++++++++++++++++++++++++++
1 files changed, 31 insertions(+), 0 deletions(-)
create mode 100644 include/trace/events/elision.h
diff --git a/include/trace/events/elision.h b/include/trace/events/elision.h
new file mode 100644
index 0000000..5d16d02
--- /dev/null
+++ b/include/trace/events/elision.h
@@ -0,0 +1,31 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM elision
+
+#if !defined(_TRACE_ELISION_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ELISION_H
+
+#include <linux/lockdep.h>
+#include <linux/tracepoint.h>
+
+#ifdef CONFIG_RTM_LOCKS
+
+TRACE_EVENT(elision_skip_start,
+ TP_PROTO(void *lock, u32 status),
+ TP_ARGS(lock, status),
+ TP_STRUCT__entry(
+ __field(void *, lock)
+ __field(u32, status)
+ ),
+ TP_fast_assign(
+ __entry->lock = lock;
+ __entry->status = status;
+ ),
+ TP_printk("%p %x", __entry->lock, __entry->status)
+);
+
+#endif
+
+#endif
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add elision adaptation state to the rwlocks and use the generic
adaptation wrapper. This unfortunately increases the size of the rwlock:
6 bytes for NR_CPUS>=2048, otherwise by 2 bytes.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/rwlock.h | 30 +++++++++++++++++++++---------
arch/x86/kernel/rtm-locks.c | 22 +++++++++++++++++-----
2 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/rwlock.h b/arch/x86/include/asm/rwlock.h
index a5370a0..a3929cc 100644
--- a/arch/x86/include/asm/rwlock.h
+++ b/arch/x86/include/asm/rwlock.h
@@ -6,9 +6,15 @@
#if CONFIG_NR_CPUS <= 2048
#ifndef __ASSEMBLY__
-typedef union {
- s32 lock;
- s32 write;
+typedef struct {
+ union {
+ s32 lock;
+ s32 write;
+ };
+#ifdef CONFIG_RTM_LOCKS
+ short elision_adapt;
+ /* 2 bytes padding */
+#endif
} arch_rwlock_t;
#endif
@@ -24,12 +30,18 @@ typedef union {
#include <linux/const.h>
#ifndef __ASSEMBLY__
-typedef union {
- s64 lock;
- struct {
- u32 read;
- s32 write;
+typedef struct {
+ union {
+ s64 lock;
+ struct {
+ u32 read;
+ s32 write;
+ };
};
+#ifdef CONFIG_RTM_LOCKS
+ short elision_adapt;
+ /* 6 bytes padding for now */
+#endif
} arch_rwlock_t;
#endif
@@ -42,7 +54,7 @@ typedef union {
#endif /* CONFIG_NR_CPUS */
-#define __ARCH_RW_LOCK_UNLOCKED { RW_LOCK_BIAS }
+#define __ARCH_RW_LOCK_UNLOCKED { { RW_LOCK_BIAS } }
/* Actual code is in asm/spinlock.h or in arch/x86/lib/rwlock.S */
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index 1651049..bc3275a 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -155,8 +155,16 @@ static int rtm_spin_is_locked(struct arch_spinlock *lock)
* This uses direct calls with static patching, not pvops.
*/
-__read_mostly bool rwlock_elision = true;
-module_param(rwlock_elision, bool, 0644);
+static struct static_key rwlock_elision = STATIC_KEY_INIT_FALSE;
+module_param(rwlock_elision, static_key, 0644);
+
+static __read_mostly struct elision_config readlock_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(readlock, readlock_elision_config);
+
+static __read_mostly struct elision_config writelock_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(writelock, writelock_elision_config);
void rtm_read_lock(arch_rwlock_t *rw)
{
@@ -167,7 +175,8 @@ void rtm_read_lock(arch_rwlock_t *rw)
* would abort anyways.
*/
- if (!elide_lock(rwlock_elision, !arch_rwlock_is_locked(rw)))
+ if (!elide_lock_adapt(rwlock_elision, !arch_rwlock_is_locked(rw),
+ &rw->elision_adapt, &readlock_elision_config))
arch_do_read_lock(rw);
}
EXPORT_SYMBOL(rtm_read_lock);
@@ -210,7 +219,8 @@ EXPORT_SYMBOL(rtm_read_unlock_irqrestore);
int rtm_read_trylock(arch_rwlock_t *rw)
{
- if (elide_lock(rwlock_elision, !arch_rwlock_is_locked(rw)))
+ if (elide_lock_adapt(rwlock_elision, !arch_rwlock_is_locked(rw),
+ &rw->elision_adapt, &readlock_elision_config))
return 1;
return arch_do_read_trylock(rw);
}
@@ -218,7 +228,8 @@ EXPORT_SYMBOL(rtm_read_trylock);
void rtm_write_lock(arch_rwlock_t *rw)
{
- if (!elide_lock(rwlock_elision, !arch_write_can_lock(rw)))
+ if (!elide_lock_adapt(rwlock_elision, !arch_write_can_lock(rw),
+ &rw->elision_adapt, &writelock_elision_config))
arch_do_write_lock(rw);
}
EXPORT_SYMBOL(rtm_write_lock);
@@ -451,6 +462,7 @@ void init_rtm_spinlocks(void)
pv_irq_ops.restore_fl = PV_CALLEE_SAVE(rtm_restore_fl);
pv_init_ops.patch = rtm_patch;
+ static_key_slow_inc(&rwlock_elision);
static_key_slow_inc(&mutex_elision);
}
--
1.7.7.6
From: Andi Kleen <[email protected]>
Very straight forward. Use the non-adaptive elision wrappers for
bit spinlocks. This is useful because they perform very poorly
under contention.
The functions are a bit on the big side for inlining now, but
I kept them inline for now.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/kernel/rtm-locks.c | 5 +++--
include/linux/bit_spinlock.h | 16 ++++++++++++++++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index bc3275a..40a0e7d 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -464,14 +464,15 @@ void init_rtm_spinlocks(void)
static_key_slow_inc(&rwlock_elision);
static_key_slow_inc(&mutex_elision);
+ bitlock_elision = true;
}
__read_mostly struct elision_config mutex_elision_config =
DEFAULT_ELISION_CONFIG;
TUNE_ELISION_CONFIG(mutex, mutex_elision_config);
-__read_mostly bool rwsem_elision = true;
-module_param(rwsem_elision, bool, 0644);
+__read_mostly bool bitlock_elision;
+module_param(bitlock_elision, bool, 0644);
module_param_cb(lock_el_skip, ¶m_ops_percpu_uint, &lock_el_skip,
0644);
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index 3b5bafc..2954b86 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -5,6 +5,9 @@
#include <linux/preempt.h>
#include <linux/atomic.h>
#include <linux/bug.h>
+#include <linux/elide.h>
+
+extern bool bitlock_elision;
/*
* bit-based spin_lock()
@@ -14,6 +17,9 @@
*/
static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
+ if (elide_lock(bitlock_elision, test_bit(bitnum, addr) == 0))
+ return;
+
/*
* Assuming the lock is uncontended, this never enters
* the body of the outer loop. If it is contended, then
@@ -39,6 +45,9 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
*/
static inline int bit_spin_trylock(int bitnum, unsigned long *addr)
{
+ if (elide_lock(bitlock_elision, test_bit(bitnum, addr) == 0))
+ return 1;
+
preempt_disable();
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
if (unlikely(test_and_set_bit_lock(bitnum, addr))) {
@@ -55,6 +64,9 @@ static inline int bit_spin_trylock(int bitnum, unsigned long *addr)
*/
static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
{
+ if (elide_unlock(test_bit(bitnum, addr) == 0))
+ return;
+
#ifdef CONFIG_DEBUG_SPINLOCK
BUG_ON(!test_bit(bitnum, addr));
#endif
@@ -72,6 +84,9 @@ static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
*/
static inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
{
+ if (elide_unlock(test_bit(bitnum, addr) == 0))
+ return;
+
#ifdef CONFIG_DEBUG_SPINLOCK
BUG_ON(!test_bit(bitnum, addr));
#endif
@@ -87,6 +102,7 @@ static inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
*/
static inline int bit_spin_is_locked(int bitnum, unsigned long *addr)
{
+ elide_abort();
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
return test_bit(bitnum, addr);
#elif defined CONFIG_PREEMPT_COUNT
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add generic noop macros (act like transaction aborted) for RTM.
The main use case is an occasional _xtest() added to generic
code, without needing ifdefs. On x86+RTM this will use
real TSX instructions.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rtm.h | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)
create mode 100644 include/linux/rtm.h
diff --git a/include/linux/rtm.h b/include/linux/rtm.h
new file mode 100644
index 0000000..887b221
--- /dev/null
+++ b/include/linux/rtm.h
@@ -0,0 +1,15 @@
+#ifndef _LINUX_RTM
+#define _LINUX_RTM 1
+
+#ifdef CONFIG_RTM_LOCKS
+#include <asm/rtm.h>
+#else
+/* Make transactions appear as always abort */
+#define _XBEGIN_STARTED 0
+#define _xbegin() 1
+#define _xtest() 0
+#define _xend() do {} while (0)
+#define _xabort(x) do {} while (0)
+#endif
+
+#endif
--
1.7.7.6
From: Andi Kleen <[email protected]>
For some locks that have a low chance of not aborting it's
best to just disable transactions.
Add a counter to thread_info to allow to disable tranasctions
for the current task.
I originally experimented with more complicated solutions,
like a magic spinlock type to disable with magic type matching,
but that was very intrusive in the locking code.
This setup -- while adding a few more instructions -- is more
flexible and seems to work quite nicely.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/thread_info.h | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 2cd056e..91bb84e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -39,6 +39,7 @@ struct thread_info {
*/
__u8 supervisor_stack[0];
#endif
+ unsigned notxn; /* transactions are disabled */
unsigned int sig_on_uaccess_error:1;
unsigned int uaccess_err:1; /* uaccess failed */
};
@@ -279,6 +280,13 @@ static inline bool is_ia32_task(void)
#endif
return false;
}
+
+#ifdef CONFIG_RTM_LOCKS
+#define txn_disabled() (current_thread_info()->notxn)
+#define disable_txn() (current_thread_info()->notxn++)
+#define reenable_txn() (current_thread_info()->notxn--)
+#endif
+
#endif /* !__ASSEMBLY__ */
#ifndef __ASSEMBLY__
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add the elision adaption state to struct mutex and use elide_adapt()
This allows mutexes that do not elide to automatically disable
elision on themselves for some time. This means we have a fail-safe
for mutexes that do not elide well and do not need to annotate
very mutex.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mutex.h | 20 +++++++++++++++-----
arch/x86/kernel/rtm-locks.c | 10 ++++++++--
include/linux/mutex.h | 3 +++
3 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7f585e3..bb9cf9d 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -10,8 +10,10 @@
#define ARCH_HAS_MUTEX_AND_OWN 1
#include <linux/elide.h>
+#include <linux/jump_label.h>
-extern bool mutex_elision;
+extern struct static_key mutex_elision;
+extern struct elision_config mutex_elision_config;
/*
* Try speculation first and only do the normal locking and owner setting
@@ -21,8 +23,10 @@ extern bool mutex_elision;
#define mutex_free(l) (atomic_read(&(l)->count) == 1)
#define __mutex_fastpath_lock_and_own(l, s) ({ \
- if (!elide_lock(mutex_elision, \
- mutex_free(l))) { \
+ if (!elide_lock_adapt(mutex_elision, \
+ mutex_free(l), \
+ &l->elision_adapt, \
+ &mutex_elision_config)) { \
__mutex_fastpath_lock(&(l)->count, s); \
mutex_set_owner(l); \
} \
@@ -37,7 +41,10 @@ extern bool mutex_elision;
#define __mutex_fastpath_lock_retval_and_own(l, s) ({ \
int ret = 0; \
- if (!elide_lock(mutex_elision, mutex_free(l))) { \
+ if (!elide_lock_adapt(mutex_elision, \
+ mutex_free(l), \
+ &l->elision_adapt, \
+ &mutex_elision_config)) { \
ret = __mutex_fastpath_lock_retval(&(l)->count, s); \
if (!ret) \
mutex_set_owner(l); \
@@ -46,7 +53,10 @@ extern bool mutex_elision;
#define __mutex_fastpath_trylock_and_own(l, s) ({ \
int ret = 1; \
- if (!elide_lock(mutex_elision, mutex_free(l))) {\
+ if (!elide_lock_adapt(mutex_elision, \
+ mutex_free(l), \
+ &l->elision_adapt, \
+ &mutex_elision_config)) { \
ret = __mutex_fastpath_trylock(&(l)->count, s); \
if (ret) \
mutex_set_owner(l); \
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index 8d4763d..a313a81 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -416,6 +416,9 @@ static unsigned rtm_patch(u8 type, u16 clobbers, void *ibuf,
}
}
+struct static_key mutex_elision = STATIC_KEY_INIT_FALSE;
+module_param(mutex_elision, static_key, 0644);
+
void init_rtm_spinlocks(void)
{
if (!boot_cpu_has(X86_FEATURE_RTM))
@@ -441,10 +444,13 @@ void init_rtm_spinlocks(void)
pv_irq_ops.irq_enable = PV_CALLEE_SAVE(rtm_irq_enable);
pv_irq_ops.restore_fl = PV_CALLEE_SAVE(rtm_restore_fl);
pv_init_ops.patch = rtm_patch;
+
+ static_key_slow_inc(&mutex_elision);
}
-__read_mostly bool mutex_elision = true;
-module_param(mutex_elision, bool, 0644);
+__read_mostly struct elision_config mutex_elision_config =
+ DEFAULT_ELISION_CONFIG;
+TUNE_ELISION_CONFIG(mutex, mutex_elision_config);
__read_mostly bool rwsem_elision = true;
module_param(rwsem_elision, bool, 0644);
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 0574095..7ae9a08 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -54,6 +54,9 @@ struct mutex {
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct *owner;
#endif
+#ifdef CONFIG_RTM_LOCKS
+ short elision_adapt;
+#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
void *magic;
--
1.7.7.6
From: Andi Kleen <[email protected]>
Add a module param type for uint static keys. Useful for
time critical flags.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/moduleparam.h | 6 ++++++
kernel/params.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index e3c41af..ac815d2 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -398,6 +398,12 @@ extern int param_set_bint(const char *val, const struct kernel_param *kp);
extern struct kernel_param_ops param_ops_percpu_uint;
#define param_check_percpu_uint(name, p) param_check_uint
+/* Static key module params. Useful for time critical flags. */
+struct static_key;
+extern struct kernel_param_ops param_ops_static_key;
+#define param_check_static_key(name, p) \
+ __param_check(name, p, struct static_key)
+
/**
* module_param_array - a parameter which is an array of some type
* @name: the name of the array variable
diff --git a/kernel/params.c b/kernel/params.c
index 1a8a1ca..68f9482 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -23,6 +23,7 @@
#include <linux/err.h>
#include <linux/slab.h>
#include <linux/ctype.h>
+#include <linux/jump_label.h>
/* Protects all parameters, and incidentally kmalloced_param list. */
static DEFINE_MUTEX(param_lock);
@@ -527,6 +528,40 @@ struct kernel_param_ops param_ops_percpu_uint = {
};
EXPORT_SYMBOL(param_ops_percpu_uint);
+/* Static key module params. Like a bool, but allows patching the jumps. */
+
+/* Noone else should change */
+static int param_static_key_set(const char *val, const struct kernel_param *kp)
+{
+ static DEFINE_SPINLOCK(lock);
+ bool l;
+ struct static_key *key = (struct static_key *)(kp->arg);
+ int ret = strtobool(val, &l);
+
+ if (ret < 0)
+ return ret;
+ /* Work around racy static key interface */
+ spin_lock(&lock);
+ if (l && !atomic_read(&key->enabled))
+ static_key_slow_inc(key);
+ if (!l && atomic_read(&key->enabled))
+ static_key_slow_dec(key);
+ spin_unlock(&lock);
+ return ret;
+}
+
+static int param_static_key_get(char *buffer, const struct kernel_param *kp)
+{
+ struct static_key *key = kp->arg;
+ return sprintf(buffer, "%u", atomic_read(&key->enabled) > 0);
+}
+
+struct kernel_param_ops param_ops_static_key = {
+ .set = param_static_key_set,
+ .get = param_static_key_get,
+};
+EXPORT_SYMBOL(param_ops_static_key);
+
/* sysfs output in /sys/modules/XYZ/parameters/ */
#define to_module_attr(n) container_of(n, struct module_attribute, attr)
#define to_module_kobject(n) container_of(n, struct module_kobject, kobj)
--
1.7.7.6
From: Andi Kleen <[email protected]>
rwspinlocks don't support paravirt ops, so add hooks to call lock elision
based on the CPUID bit. We use the standard patching, so the overhead
in the fast path is low when RTM is not supported.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/spinlock.h | 104 +++++++++++++++++++++++++++++++++++++--
1 files changed, 99 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..5bff3c0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -6,6 +6,10 @@
#include <asm/processor.h>
#include <linux/compiler.h>
#include <asm/paravirt.h>
+#include <asm/cpufeature.h>
+#include <asm/rtm-locks.h>
+#include <linux/elide.h>
+
/*
* Your basic SMP spinlocks, allowing only a single CPU anywhere
*
@@ -169,7 +173,12 @@ static inline int arch_write_can_lock(arch_rwlock_t *lock)
return lock->write == WRITE_LOCK_CMP;
}
-static inline void arch_read_lock(arch_rwlock_t *rw)
+static inline int arch_rwlock_is_locked(arch_rwlock_t *lock)
+{
+ return lock->lock != RW_LOCK_BIAS;
+}
+
+static inline void arch_do_read_lock(arch_rwlock_t *rw)
{
asm volatile(LOCK_PREFIX READ_LOCK_SIZE(dec) " (%0)\n\t"
"jns 1f\n"
@@ -178,7 +187,20 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
::LOCK_PTR_REG (rw) : "memory");
}
-static inline void arch_write_lock(arch_rwlock_t *rw)
+#define ARCH_HAS_RWLOCK_UNLOCK_IRQ 1
+
+#ifdef CONFIG_RTM_LOCKS
+#define RTM_OR_NORMAL(n, r) (static_cpu_has(X86_FEATURE_RTM) ? (r) : (n))
+#else
+#define RTM_OR_NORMAL(n, r) n
+#endif
+
+static inline void arch_read_lock(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_read_lock(rw), rtm_read_lock(rw));
+}
+
+static inline void arch_do_write_lock(arch_rwlock_t *rw)
{
asm volatile(LOCK_PREFIX WRITE_LOCK_SUB(%1) "(%0)\n\t"
"jz 1f\n"
@@ -188,7 +210,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
: "memory");
}
-static inline int arch_read_trylock(arch_rwlock_t *lock)
+static inline void arch_write_lock(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_write_lock(rw), rtm_write_lock(rw));
+}
+
+static inline int arch_do_read_trylock(arch_rwlock_t *lock)
{
READ_LOCK_ATOMIC(t) *count = (READ_LOCK_ATOMIC(t) *)lock;
@@ -198,6 +225,13 @@ static inline int arch_read_trylock(arch_rwlock_t *lock)
return 0;
}
+static inline int arch_read_trylock(arch_rwlock_t *lock)
+{
+ return RTM_OR_NORMAL(arch_do_read_trylock(lock),
+ rtm_read_trylock(lock));
+}
+
+/* Not elided because noone uses it */
static inline int arch_write_trylock(arch_rwlock_t *lock)
{
atomic_t *count = (atomic_t *)&lock->write;
@@ -208,18 +242,78 @@ static inline int arch_write_trylock(arch_rwlock_t *lock)
return 0;
}
-static inline void arch_read_unlock(arch_rwlock_t *rw)
+static inline void arch_do_read_unlock(arch_rwlock_t *rw)
{
asm volatile(LOCK_PREFIX READ_LOCK_SIZE(inc) " %0"
:"+m" (rw->lock) : : "memory");
}
-static inline void arch_write_unlock(arch_rwlock_t *rw)
+static inline void arch_do_write_unlock(arch_rwlock_t *rw)
{
asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
: "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}
+static inline void arch_read_unlock(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_read_unlock(rw), rtm_read_unlock(rw));
+}
+
+static inline void arch_write_unlock(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_write_unlock(rw), rtm_write_unlock(rw));
+}
+
+static inline void arch_do_read_unlock_irqrestore(arch_rwlock_t *rw,
+ unsigned long flags)
+{
+ arch_do_read_unlock(rw);
+ local_irq_restore(flags);
+}
+
+static inline void arch_do_write_unlock_irqrestore(arch_rwlock_t *rw,
+ unsigned long flags)
+{
+ arch_do_write_unlock(rw);
+ local_irq_restore(flags);
+}
+
+static inline void arch_read_unlock_irqrestore(arch_rwlock_t *rw,
+ unsigned long flags)
+{
+ RTM_OR_NORMAL(arch_do_read_unlock_irqrestore(rw, flags),
+ rtm_read_unlock_irqrestore(rw, flags));
+}
+
+static inline void arch_write_unlock_irqrestore(arch_rwlock_t *rw,
+ unsigned long flags)
+{
+ RTM_OR_NORMAL(arch_do_write_unlock_irqrestore(rw, flags),
+ rtm_write_unlock_irqrestore(rw, flags));
+}
+
+static inline void arch_do_read_unlock_irq(arch_rwlock_t *rw)
+{
+ arch_do_read_unlock(rw);
+ local_irq_enable();
+}
+
+static inline void arch_do_write_unlock_irq(arch_rwlock_t *rw)
+{
+ arch_do_write_unlock(rw);
+ local_irq_enable();
+}
+
+static inline void arch_read_unlock_irq(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_read_unlock_irq(rw), rtm_read_unlock_irq(rw));
+}
+
+static inline void arch_write_unlock_irq(arch_rwlock_t *rw)
+{
+ RTM_OR_NORMAL(arch_do_write_unlock_irq(rw), rtm_write_unlock_irq(rw));
+}
+
#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
--
1.7.7.6
From: Andi Kleen <[email protected]>
Extend the elide() macro to support adaptation state per lock.
The adaptation keeps track whether the elision is successfull.
When the lock aborts due to internal reasons (e.g. it always
writes a MSR or always does MMIO) disable elision for some time.
The state is kept as a "short" count in the lock. This can
be then in the elision wrapper.
This just adds the infrastructure, will be actually used
in followon patches.
We use a per cpu variant of module params as statistic counters for now.
This could be moved elsewhere in sysfs too, but it would be far
more code.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/elide.h | 83 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/rtm-locks.c | 99 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 182 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/elide.h b/arch/x86/include/asm/elide.h
index c492aed..d102bb6 100644
--- a/arch/x86/include/asm/elide.h
+++ b/arch/x86/include/asm/elide.h
@@ -4,12 +4,52 @@
#ifdef CONFIG_RTM_LOCKS
#include <asm/rtm.h>
+struct elision_config {
+ short internal_abort_skip;
+ short lock_busy_skip;
+ short other_abort_skip;
+ short conflict_abort_skip;
+ int conflict_retry;
+ int retry_timeout;
+ int lock_busy_retry;
+};
+
+/* Tuning preliminary */
+#define DEFAULT_ELISION_CONFIG { \
+ .internal_abort_skip = 5, \
+ .lock_busy_skip = 3, \
+ .other_abort_skip = 3, \
+ .conflict_abort_skip = 3, \
+ .conflict_retry = 3, \
+ .retry_timeout = 500, \
+ .lock_busy_retry = 3, \
+}
+
+#define TUNE_ELISION_CONFIG(prefix, name) \
+ module_param_named(prefix ## _internal_abort_skip, \
+ name.internal_abort_skip, short, 0644); \
+ module_param_named(prefix ## _lock_busy_skip, \
+ name.lock_busy_skip, short, 0644); \
+ module_param_named(prefix ## _other_abort_skip, \
+ name.other_abort_skip, short, 0644); \
+ module_param_named(prefix ## _conflict_abort_skip, \
+ name.conflict_abort_skip, short, 0644); \
+ module_param_named(prefix ## _conflict_retry, \
+ name.conflict_retry, int, 0644); \
+ module_param_named(prefix ## _retry_timeout, \
+ name.retry_timeout, int, 0644); \
+ module_param_named(prefix ## _lock_busy_retry, \
+ name.lock_busy_retry, int, 0644)
+
+
/*
* These are out of line unfortunately, just to avoid
* a nasty include loop with per cpu data.
* (FIXME)
*/
extern int __elide_lock(void);
+extern int __elide_lock_adapt(short *adapt, struct elision_config *config,
+ int *retry);
extern void __elide_unlock(void);
/*
@@ -33,6 +73,49 @@ extern void __elide_unlock(void);
flag; \
})
+enum { ELIDE_TXN, ELIDE_STOP, ELIDE_RETRY };
+
+/*
+ * Adaptive elision lock wrapper
+ *
+ * Like above. a is a pointer to a
+ * short adaption count stored in the lock.
+ * config is a pointer to a elision_config for the lock type
+ *
+ * This is a bit convulted because we need a retry loop with a lock test
+ * to wait for the lock freeing again. Right now we just spin up to
+ * a defined number of iterations.
+ *
+ * Ideally every lock that can afford to have a 16 bit count stored
+ * in it should use this variant.
+ */
+#define elide_lock_adapt(f, l, a, config) ({ \
+ int flag = 0; \
+ if (static_key_true(&(f))) { \
+ int retry = (config)->conflict_retry; \
+ int timeout = (config)->retry_timeout; \
+ int status; \
+ again: \
+ status = __elide_lock_adapt(a, config, &retry); \
+ /* Retries wait until the lock is free. */ \
+ if (unlikely(status == ELIDE_RETRY)) { \
+ while (!(l) && --timeout > 0) \
+ cpu_relax(); \
+ if (timeout > 0) \
+ goto again; \
+ } \
+ if (likely(status != ELIDE_STOP)) { \
+ /* in transaction. check now if the lock is free. */ \
+ if (likely(l)) \
+ flag = 1; \
+ else \
+ _xabort(0xff); \
+ } \
+ } \
+ flag; \
+})
+
+
/*
* Note that if you see a general protection fault
* in the _xend you have a unmatched unlock. Please fix
diff --git a/arch/x86/kernel/rtm-locks.c b/arch/x86/kernel/rtm-locks.c
index 1811028..8d4763d 100644
--- a/arch/x86/kernel/rtm-locks.c
+++ b/arch/x86/kernel/rtm-locks.c
@@ -36,6 +36,9 @@
#include <asm/rtm.h>
#include <asm/paravirt.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/elision.h>
+
/*
* We need a software in_tx marker, to answer the question
* "Is this an inner nested transaction commit?" inside the transaction.
@@ -265,6 +268,97 @@ inline int __elide_lock(void)
}
EXPORT_SYMBOL(__elide_lock);
+/* XXX make per lock type */
+static DEFINE_PER_CPU(unsigned, lock_el_skip);
+static DEFINE_PER_CPU(unsigned, lock_el_start_skip);
+
+/*
+ * Implement a simple adaptive algorithm. When the lock aborts
+ * for an internal (not external) cause we stop eliding for some time.
+ * For a conflict we retry a defined number of times, then also skip.
+ * For other aborts we also skip. All the skip counts are individually
+ * configurable.
+ *
+ * The retry loop is in the caller, as it needs to wait for the lock
+ * to free itself first. Otherwise we would cycle too fast through
+ * the retries.
+ *
+ * The complex logic is generally in the slow path, when we would
+ * have blocked anyways.
+ */
+
+static inline void skip_update(short *count, short skip, unsigned status)
+{
+ __this_cpu_inc(lock_el_start_skip);
+ /* Can lose updates, but that is ok as this is just a hint. */
+ if (*count != skip) {
+ trace_elision_skip_start(count, status);
+ *count = skip;
+ }
+}
+
+static int
+__elide_lock_adapt_slow(short *count, struct elision_config *config,
+ int *retry, unsigned status)
+{
+ /* Internal abort? Adapt the mutex */
+ if (!(status & _XABORT_RETRY)) {
+ /* Lock busy is a special case */
+ if ((status & _XABORT_EXPLICIT) &&
+ _XABORT_CODE(status) == 0xff) {
+ /* Do some retries on lock busy. */
+ if (*retry > 0) {
+ if (*retry == config->conflict_retry &&
+ config->lock_busy_retry <
+ config->conflict_retry)
+ *retry = config->lock_busy_retry;
+ (*retry)--;
+ return ELIDE_RETRY;
+ }
+ skip_update(count, config->lock_busy_skip, status);
+ } else
+ skip_update(count, config->internal_abort_skip, status);
+ return ELIDE_STOP;
+ }
+ if (!(status & _XABORT_CONFLICT)) {
+ /* No retries for capacity. Maybe later. */
+ skip_update(count, config->other_abort_skip, status);
+ return ELIDE_STOP;
+ }
+ /* Was a conflict. Do some retries. */
+ if (*retry > 0) {
+ /* In caller wait for lock becoming free and then retry. */
+ (*retry)--;
+ return ELIDE_RETRY;
+ }
+
+ skip_update(count, config->conflict_abort_skip, status);
+ return ELIDE_STOP;
+}
+
+inline int __elide_lock_adapt(short *count, struct elision_config *config,
+ int *retry)
+{
+ unsigned status;
+
+ if (unlikely(txn_disabled()))
+ return ELIDE_STOP;
+ /* Adapted lock? */
+ if (unlikely(*count > 0)) {
+ /* Can lose updates, but that is ok as this is just a hint. */
+ (*count)--;
+ /* TBD should count this per lock type and per lock */
+ __this_cpu_inc(lock_el_skip);
+ return ELIDE_STOP;
+ }
+ if (likely((status = _xbegin()) == _XBEGIN_STARTED)) {
+ start_in_tx();
+ return ELIDE_TXN;
+ }
+ return __elide_lock_adapt_slow(count, config, retry, status);
+}
+EXPORT_SYMBOL(__elide_lock_adapt);
+
inline void __elide_unlock(void)
{
/*
@@ -354,3 +448,8 @@ module_param(mutex_elision, bool, 0644);
__read_mostly bool rwsem_elision = true;
module_param(rwsem_elision, bool, 0644);
+
+module_param_cb(lock_el_skip, ¶m_ops_percpu_uint, &lock_el_skip,
+ 0644);
+module_param_cb(lock_el_start_skip, ¶m_ops_percpu_uint,
+ &lock_el_start_skip, 0644);
--
1.7.7.6
From: Andi Kleen <[email protected]>
__read/write_lock_failed did execute a PAUSE first thing before
checking the lock. This aborts transactions. Check the lock
state again before executing the pause. This avoids a small
number of extra aborts, and is slightly cheaper too.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/lib/rwlock.S | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/rwlock.S b/arch/x86/lib/rwlock.S
index 1cad221..9a6bc12 100644
--- a/arch/x86/lib/rwlock.S
+++ b/arch/x86/lib/rwlock.S
@@ -16,10 +16,12 @@ ENTRY(__write_lock_failed)
FRAME
0: LOCK_PREFIX
WRITE_LOCK_ADD($RW_LOCK_BIAS) (%__lock_ptr)
+ cmpl $WRITE_LOCK_CMP, (%__lock_ptr)
+ je 2f
1: rep; nop
cmpl $WRITE_LOCK_CMP, (%__lock_ptr)
jne 1b
- LOCK_PREFIX
+2: LOCK_PREFIX
WRITE_LOCK_SUB($RW_LOCK_BIAS) (%__lock_ptr)
jnz 0b
ENDFRAME
@@ -32,10 +34,12 @@ ENTRY(__read_lock_failed)
FRAME
0: LOCK_PREFIX
READ_LOCK_SIZE(inc) (%__lock_ptr)
+ READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
+ jns 2f
1: rep; nop
READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
js 1b
- LOCK_PREFIX
+2: LOCK_PREFIX
READ_LOCK_SIZE(dec) (%__lock_ptr)
js 0b
ENDFRAME
--
1.7.7.6
From: Andi Kleen <[email protected]>
This adds the basic RTM (Restricted Transactional Memory)
intrinsics for TSX, implemented with alternative() so that they can be
transparently used without checking CPUID first.
When the CPU does not support TSX we just always jump to the abort handler.
These intrinsics are only expected to be used by some low level code
that presents higher level interface (like locks).
This is using the same interface as gcc and icc. There's a way to implement
the intrinsics more efficiently with newer compilers that support asm goto,
but to avoid undue dependencies on new tool chains this is not used here.
Also the current way looks slightly nicer, at the cost of only two more
instructions.
Also don't require a TSX aware assembler -- all new instructions are implemented
with .byte.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/rtm.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 82 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/rtm.h
diff --git a/arch/x86/include/asm/rtm.h b/arch/x86/include/asm/rtm.h
new file mode 100644
index 0000000..7075a04
--- /dev/null
+++ b/arch/x86/include/asm/rtm.h
@@ -0,0 +1,82 @@
+#ifndef _RTM_OFFICIAL_H
+#define _RTM_OFFICIAL_H 1
+
+#include <linux/compiler.h>
+#include <asm/alternative.h>
+#include <asm/cpufeature.h>
+#include <asm/nops.h>
+
+/*
+ * RTM -- restricted transactional memory ISA
+ *
+ * Official RTM intrinsics interface matching gcc/icc, but works
+ * on older gcc compatible compilers and binutils.
+ *
+ * _xbegin() starts a transaction. When it returns a value different
+ * from _XBEGIN_STARTED a non transactional fallback path
+ * should be executed.
+ *
+ * This is a special kernel variant that supports binary patching.
+ * When the CPU does not support RTM we always jump to the abort handler.
+ * And _xtest() always returns 0.
+
+ * This means these intrinsics can be used without checking cpu_has_rtm
+ * first.
+ *
+ * This is the low level interface mapping directly to the instructions.
+ * Usually kernel code will use a higher level abstraction instead (like locks)
+ *
+ * Note this can be implemented more efficiently on compilers that support
+ * "asm goto". But we don't want to require this right now.
+ */
+
+#define _XBEGIN_STARTED (~0u)
+#define _XABORT_EXPLICIT (1 << 0)
+#define _XABORT_RETRY (1 << 1)
+#define _XABORT_CONFLICT (1 << 2)
+#define _XABORT_CAPACITY (1 << 3)
+#define _XABORT_DEBUG (1 << 4)
+#define _XABORT_NESTED (1 << 5)
+#define _XABORT_CODE(x) (((x) >> 24) & 0xff)
+
+#define _XABORT_SOFTWARE -5 /* not part of ISA */
+
+static __always_inline int _xbegin(void)
+{
+ int ret;
+ alternative_io("mov %[fallback],%[ret] ; " ASM_NOP6,
+ "mov %[started],%[ret] ; "
+ ".byte 0xc7,0xf8 ; .long 0 # XBEGIN 0",
+ X86_FEATURE_RTM,
+ [ret] "=a" (ret),
+ [fallback] "i" (_XABORT_SOFTWARE),
+ [started] "i" (_XBEGIN_STARTED) : "memory");
+ return ret;
+}
+
+static __always_inline void _xend(void)
+{
+ /* Not patched because these should be not executed in fallback */
+ asm volatile(".byte 0x0f,0x01,0xd5 # XEND" : : : "memory");
+}
+
+static __always_inline void _xabort(const unsigned int status)
+{
+ alternative_input(ASM_NOP3,
+ ".byte 0xc6,0xf8,%P0 # XABORT",
+ X86_FEATURE_RTM,
+ "i" (status) : "memory");
+}
+
+static __always_inline int _xtest(void)
+{
+ unsigned char out;
+ alternative_io("xor %0,%0 ; " ASM_NOP5,
+ ".byte 0x0f,0x01,0xd6 ; setnz %0 # XTEST",
+ X86_FEATURE_RTM,
+ "=r" (out),
+ "i" (0) : "memory");
+ return out;
+}
+
+#endif
--
1.7.7.6
From: Andi Kleen <[email protected]>
Writing _xbegin which is like setjmp in a if is very natural.
Stop checkpatch's whining about this.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
scripts/checkpatch.pl | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b28cc38..659e683 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2831,7 +2831,10 @@ sub process {
$line =~ /\b(?:if|while|for)\s*\(/ && $line !~ /^.\s*#/) {
my ($s, $c) = ($stat, $cond);
- if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s) {
+ # if ((status = _xbegin()) == _XBEGIN_STARTED) is natural,
+ # so don't warn about this case.
+ if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s &&
+ $c !~ /_xbegin/) {
ERROR("ASSIGN_IN_IF",
"do not use assignment in if condition\n" . $herecurr);
}
--
1.7.7.6
On Fri, Mar 22, 2013 at 06:25:06PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> For some locks that have a low chance of not aborting it's
> best to just disable transactions.
>
> Add a counter to thread_info to allow to disable tranasctions
> for the current task.
>
> I originally experimented with more complicated solutions,
> like a magic spinlock type to disable with magic type matching,
> but that was very intrusive in the locking code.
>
> This setup -- while adding a few more instructions -- is more
> flexible and seems to work quite nicely.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/include/asm/thread_info.h | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 2cd056e..91bb84e 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -39,6 +39,7 @@ struct thread_info {
> */
> __u8 supervisor_stack[0];
> #endif
> + unsigned notxn; /* transactions are disabled */
This surely can be a bitfield like the other two below. It is basically
begging to be one.
> unsigned int sig_on_uaccess_error:1;
> unsigned int uaccess_err:1; /* uaccess failed */
> };
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
>
> This surely can be a bitfield like the other two below. It is basically
> begging to be one.
Bit fields are slower and larger in code and unlike the others this is on hot
paths.
-Andi
--
[email protected] -- Speaking for myself only.
On Sat, Mar 23, 2013 at 02:51:56PM +0100, Andi Kleen wrote:
> Bit fields are slower and larger in code and unlike the others this is
> on hot paths.
Really? Let's see:
unsigned:
=========
.file 8 "/w/kernel/linux-2.6/arch/x86/include/asm/thread_info.h"
.loc 8 211 0
#APP
# 211 "/w/kernel/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
.LVL238:
#NO_APP
... # AMD F10h SNB
disable:
incl -8056(%rax) # ti_25->notxn # INC mem: 4 ; 6
test:
cmpl $0, -8056(%rax) #, ti_24->notxn # CMP mem, imm: 4 ; 1
reenable:
decl -8056(%rax) # ti_25->notxn # DEC mem: 4 ; 6
bitfield:
=========
.file 8 "/w/kernel/linux-2.6/arch/x86/include/asm/thread_info.h"
.loc 8 211 0
#APP
# 211 "/w/kernel/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
.LVL238:
#NO_APP
disable:
xorb $4, -8056(%rax) #, # XOR mem, imm: 1 ; 0
test:
testb $4, -8056(%rax) #, # TEST mem, imm: 4 ; -
reenable:
xorb $4, -8056(%rax) #, # XOR mem, imm: 1 ; 0
So let's explain. The AMD F10h column shows the respective instruction
latencies on AMD F10h. All instructions are DirectPath single.
The SNB column is something similar which I could find for Intel
Sandybridge: http://www.agner.org/optimize/instruction_tables.pdf. I'm
assuming Agner Fog's measurements are more or less accurate.
And wow, the XOR is *actually* faster. That's whopping three cycles on
AMD. Similar observation on SNB.
Now let's look at decoding bandwidth:
unsigned:
=========
disable:
13: ff 80 88 e0 ff ff incl -0x1f78(%rax)
test:
9: 83 b8 88 e0 ff ff 00 cmpl $0x0,-0x1f78(%rax)
reenable:
13: ff 88 88 e0 ff ff decl -0x1f78(%rax)
bitfield:
=========
disable:
13: 80 b0 88 e0 ff ff 04 xorb $0x4,-0x1f78(%rax)
test:
9: f6 80 88 e0 ff ff 04 testb $0x4,-0x1f78(%rax)
reenable:
13: 80 b0 88 e0 ff ff 04 xorb $0x4,-0x1f78(%rax)
This particular XOR encoding is 1 byte longer, the rest is on-par.
Oh, and compiler is gcc (Debian 4.7.2-5) 4.7.2.
So you were saying?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Sat, Mar 23, 2013 at 04:52:56PM +0100, Borislav Petkov wrote:
> And wow, the XOR is *actually* faster. That's whopping three cycles on
> AMD. Similar observation on SNB.
Ok, correction: I was looking at the wrong row in the table. The XOR
version with an immediate and mem access is actually as fast as the
INC/DEC mem versions: 4 cycles on AMD F10h and 6 on SNB, according to
the docs. I should've realized that there's still a mem access there...
Anyway, the rest stands.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Fri, Mar 22, 2013 at 6:24 PM, Andi Kleen <[email protected]> wrote:
>
> Some questions and answers:
>
> - How much does it improve performance?
> I cannot share any performance numbers at this point unfortunately.
> Also please keep in mind that the tuning is very preliminary and
> will be revised.
Quite frankly, since the *only* reason for RTM is performance, this
fundamentally makes the patch-set pointless.
If we don't know how much it helps, we can't judge whether it's worth
even discussing this patch. It adds enough complexity that it had
better be worth it, and without knowing the performance side, all we
can see are the negatives.
Talk to your managers about this. Tell them that without performance
numbers, any patch-series like this is totally pointless.
Does it make non-contended code slower? We don't know. Does it improve
anything but micro-benchmarks? We don't know. Is there any point to
this? WE DON"T KNOW.
Inside of intel, it might be useful for testing and validating the
hardware. Outside of intel, it is totally useless without performance
numbers.
The other comment I have is that since it does touch non-x86 header
files etc (although not a lot), you really need to talk to the POWER8
people about naming of the thing. Calling it <linux/rtm.h> and having
"generic" helpers called _xtest() used by the generic spinlock code
sounds a bit suspect.
Linus
On Sat, Mar 23, 2013 at 8:52 AM, Borislav Petkov <[email protected]> wrote:
>
> Really? Let's see:
Your test seems to assume that a single bit is sufficient, which
sounds unlikely. If you have any kind of nesting going on, you need
more than one bit. Then add/sub isn't a single "xor" of a bit any
more.
That said, making it just a byte sounds more than enough. How deep
would anybody want to nest it? Do we care?
Linus
On Sat, Mar 23, 2013 at 10:16:28AM -0700, Linus Torvalds wrote:
> On Sat, Mar 23, 2013 at 8:52 AM, Borislav Petkov <[email protected]> wrote:
> >
> > Really? Let's see:
>
> Your test seems to assume that a single bit is sufficient, which
> sounds unlikely. If you have any kind of nesting going on, you need
> more than one bit. Then add/sub isn't a single "xor" of a bit any
> more.
Right, but does nesting even apply here? I mean, AFAICR the commit
message, we're talking about per-thread transaction disabling. IOW, this
is an on/off button used in a boolean context.
Or is this going to be used as a mechanism to tisable TSX when max
nesting level has been reached?
Hmmm.
Btw, the disable/reenable_txn() aren't used anywhere in the patchset and
I'd actually like to get an idea where/how those are used.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
Hi Linux,
Thanks. Other code/design review would be still appreciated, even
under the current constraints.
> The other comment I have is that since it does touch non-x86 header
> files etc (although not a lot), you really need to talk to the POWER8
> people about naming of the thing. Calling it <linux/rtm.h> and having
> "generic" helpers called _xtest() used by the generic spinlock code
> sounds a bit suspect.
I can make up another name for _xtest()/_xabort() and linux/rtm.h,
(any suggestions?)
The basic concepts implemented there should be pretty universal.
If others have a equivalent of "is this a transaction" and "abort
this tranction" they can just plug it in. Otherwise they will nop it,
as it's only hints anyways.
The only things used outside x86 code is _xtest()/_xabort(), can
remove the rest from linux/*. Without transactions this is all nops.
The primary interface for the lock code is the much higher level
elide()/elide_lock_adapt() interface anyways.
-Andi
--
[email protected] -- Speaking for myself only.
> That said, making it just a byte sounds more than enough. How deep
> would anybody want to nest it? Do we care?
Byte should be fine. I'll change that.
-Andi
--
[email protected] -- Speaking for myself only.
On Sat, Mar 23, 2013 at 07:00:10PM +0100, Andi Kleen wrote:
>
> Hi Linux,
Also I debut on finally making that famous typo too. Sorry.
-Andi
On Sat, 2013-03-23 at 19:00 +0100, Andi Kleen wrote:
> Hi Linux,
>
> Thanks. Other code/design review would be still appreciated, even
> under the current constraints.
>
> > The other comment I have is that since it does touch non-x86 header
> > files etc (although not a lot), you really need to talk to the POWER8
> > people about naming of the thing. Calling it <linux/rtm.h> and having
> > "generic" helpers called _xtest() used by the generic spinlock code
> > sounds a bit suspect.
>
> I can make up another name for _xtest()/_xabort() and linux/rtm.h,
> (any suggestions?)
>
> The basic concepts implemented there should be pretty universal.
> If others have a equivalent of "is this a transaction" and "abort
> this tranction" they can just plug it in. Otherwise they will nop it,
> as it's only hints anyways.
>
> The only things used outside x86 code is _xtest()/_xabort(), can
> remove the rest from linux/*. Without transactions this is all nops.
> The primary interface for the lock code is the much higher level
> elide()/elide_lock_adapt() interface anyways.
Adding Michael Neuling to the CC list, he's probably the LTC person who
is the most familiar with POWER8 TM at the moment.
Cheers,
Ben.
> -Andi
>
> On Sat, 2013-03-23 at 19:00 +0100, Andi Kleen wrote:
> > Hi Linux,
> >
> > Thanks. Other code/design review would be still appreciated, even
> > under the current constraints.
> >
> > > The other comment I have is that since it does touch non-x86 header
> > > files etc (although not a lot), you really need to talk to the POWER8
> > > people about naming of the thing. Calling it <linux/rtm.h> and having
> > > "generic" helpers called _xtest() used by the generic spinlock code
> > > sounds a bit suspect.
> >
> > I can make up another name for _xtest()/_xabort() and linux/rtm.h,
> > (any suggestions?)
> >
> > The basic concepts implemented there should be pretty universal.
> > If others have a equivalent of "is this a transaction" and "abort
> > this tranction" they can just plug it in. Otherwise they will nop it,
> > as it's only hints anyways.
> >
> > The only things used outside x86 code is _xtest()/_xabort(), can
> > remove the rest from linux/*. Without transactions this is all nops.
> > The primary interface for the lock code is the much higher level
> > elide()/elide_lock_adapt() interface anyways.
>
> Adding Michael Neuling to the CC list, he's probably the LTC person who
> is the most familiar with POWER8 TM at the moment.
Thanks. I'll respond inline, but agree, the naming convention is very
x86 centrinc and will not suit powerpc at all.
Also, like Andy, we don't have permission to post any performance
numbers so have held off on bothering with any thing like this for now.
Mikey
>
> Cheers,
> Ben.
>
> > -Andi
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> From: Andi Kleen <[email protected]>
>
> Add generic noop macros (act like transaction aborted) for RTM.
> The main use case is an occasional _xtest() added to generic
> code, without needing ifdefs. On x86+RTM this will use
> real TSX instructions.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> include/linux/rtm.h | 15 +++++++++++++++
> 1 files changed, 15 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/rtm.h
>
> diff --git a/include/linux/rtm.h b/include/linux/rtm.h
RTM == restricted transactional memory. I don't understand why it's
"restricted" and why any other architecture else would call it that and
hence why Linux should call it that?
Can we just call it TM for transactional memory? Each arch can then
have their own implementation and call it what they want.
> new file mode 100644
> index 0000000..887b221
> --- /dev/null
> +++ b/include/linux/rtm.h
> @@ -0,0 +1,15 @@
> +#ifndef _LINUX_RTM
> +#define _LINUX_RTM 1
> +
> +#ifdef CONFIG_RTM_LOCKS
> +#include <asm/rtm.h>
> +#else
> +/* Make transactions appear as always abort */
> +#define _XBEGIN_STARTED 0
> +#define _xbegin() 1
> +#define _xtest() 0
> +#define _xend() do {} while (0)
> +#define _xabort(x) do {} while (0)
Similarly.. these are clearly x86 centric. Can we just call them
tmbegin, tmtest, tm... etc?
Mikey
> +#endif
> +
> +#endif
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> From: Andi Kleen <[email protected]>
>
> Writing _xbegin which is like setjmp in a if is very natural.
> Stop checkpatch's whining about this.
This patch should go in before the RTM tester.
>
> Cc: [email protected]
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> scripts/checkpatch.pl | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index b28cc38..659e683 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -2831,7 +2831,10 @@ sub process {
> $line =~ /\b(?:if|while|for)\s*\(/ && $line !~ /^.\s*#/) {
> my ($s, $c) = ($stat, $cond);
>
> - if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s) {
> + # if ((status = _xbegin()) == _XBEGIN_STARTED) is natural,
> + # so don't warn about this case.
> + if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s &&
> + $c !~ /_xbegin/) {
> ERROR("ASSIGN_IN_IF",
> "do not use assignment in if condition\n" . $herecurr);
> }
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> From: Andi Kleen <[email protected]>
>
> This adds the basic RTM (Restricted Transactional Memory)
> intrinsics for TSX, implemented with alternative() so that they can be
> transparently used without checking CPUID first.
>
> When the CPU does not support TSX we just always jump to the abort handler.
>
> These intrinsics are only expected to be used by some low level code
> that presents higher level interface (like locks).
>
> This is using the same interface as gcc and icc. There's a way to implement
> the intrinsics more efficiently with newer compilers that support asm goto,
> but to avoid undue dependencies on new tool chains this is not used here.
>
> Also the current way looks slightly nicer, at the cost of only two more
> instructions.
>
> Also don't require a TSX aware assembler -- all new instructions are implemented
> with .byte.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/include/asm/rtm.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 82 insertions(+), 0 deletions(-)
> create mode 100644 arch/x86/include/asm/rtm.h
>
> diff --git a/arch/x86/include/asm/rtm.h b/arch/x86/include/asm/rtm.h
> new file mode 100644
> index 0000000..7075a04
> --- /dev/null
> +++ b/arch/x86/include/asm/rtm.h
> @@ -0,0 +1,82 @@
> +#ifndef _RTM_OFFICIAL_H
> +#define _RTM_OFFICIAL_H 1
> +
> +#include <linux/compiler.h>
> +#include <asm/alternative.h>
> +#include <asm/cpufeature.h>
> +#include <asm/nops.h>
> +
> +/*
> + * RTM -- restricted transactional memory ISA
> + *
> + * Official RTM intrinsics interface matching gcc/icc, but works
> + * on older gcc compatible compilers and binutils.
> + *
> + * _xbegin() starts a transaction. When it returns a value different
> + * from _XBEGIN_STARTED a non transactional fallback path
> + * should be executed.
> + *
> + * This is a special kernel variant that supports binary patching.
> + * When the CPU does not support RTM we always jump to the abort handler.
> + * And _xtest() always returns 0.
> +
> + * This means these intrinsics can be used without checking cpu_has_rtm
> + * first.
> + *
> + * This is the low level interface mapping directly to the instructions.
> + * Usually kernel code will use a higher level abstraction instead (like locks)
> + *
> + * Note this can be implemented more efficiently on compilers that support
> + * "asm goto". But we don't want to require this right now.
> + */
> +
> +#define _XBEGIN_STARTED (~0u)
> +#define _XABORT_EXPLICIT (1 << 0)
> +#define _XABORT_RETRY (1 << 1)
> +#define _XABORT_CONFLICT (1 << 2)
> +#define _XABORT_CAPACITY (1 << 3)
> +#define _XABORT_DEBUG (1 << 4)
> +#define _XABORT_NESTED (1 << 5)
> +#define _XABORT_CODE(x) (((x) >> 24) & 0xff)
> +
> +#define _XABORT_SOFTWARE -5 /* not part of ISA */
> +
> +static __always_inline int _xbegin(void)
> +{
> + int ret;
> + alternative_io("mov %[fallback],%[ret] ; " ASM_NOP6,
> + "mov %[started],%[ret] ; "
> + ".byte 0xc7,0xf8 ; .long 0 # XBEGIN 0",
> + X86_FEATURE_RTM,
> + [ret] "=a" (ret),
> + [fallback] "i" (_XABORT_SOFTWARE),
> + [started] "i" (_XBEGIN_STARTED) : "memory");
> + return ret;
> +}
So ppc can do something like this. Stealing from
Documentation/powerpc/transactional_memory.txt, ppc transactions looks
like this:
tbegin
beq abort_handler
ld r4, SAVINGS_ACCT(r3)
ld r5, CURRENT_ACCT(r3)
subi r5, r5, 1
addi r4, r4, 1
std r4, SAVINGS_ACCT(r3)
std r5, CURRENT_ACCT(r3)
tend
b continue
abort_handler:
... test for odd failures ...
/* Retry the transaction if it failed because it conflicted with
* someone else: */
b begin_move_money
The abort handler can then see the failure reason via an SPR/status
register TEXASR. There are bits in there to specify faulure modes like:
- software failure code (set in the kernel/hypervisor. see
arch/powerpc/include/asm/reg.h)
#define TM_CAUSE_RESCHED 0xfe
#define TM_CAUSE_TLBI 0xfc
#define TM_CAUSE_FAC_UNAV 0xfa
#define TM_CAUSE_SYSCALL 0xf9 /* Persistent */
#define TM_CAUSE_MISC 0xf6
#define TM_CAUSE_SIGNAL 0xf4
- Failure persistent
- Disallowed (like disallowed instruction)
- Nested overflow
- footprint overflow
- self induced conflict
- non-transaction conflict
- transaction conflict
- instruction fetch conflict
- tabort instruction
- falure while transaction was suspended
Some of these overlap with the x86 but I think the fidelity could be
improved.
FYI the TM spec can be downloaded here:
https://www.power.org/documentation/power-isa-transactional-memory/
You're example code looks like this:
static __init int rtm_test(void)
{
unsigned status;
pr_info("simple rtm test\n");
if ((status = _xbegin()) == _XBEGIN_STARTED) {
x++;
_xend();
pr_info("transaction committed\n");
} else {
pr_info("transaction aborted %x\n", status);
}
return 0;
}
Firstly, I think we can do something like this with the ppc mnemonics,
so I think the overall idea is ok with me.
Secondly, can we make xbegin just return true/false and get the status
later if needed?
Something like (changing the 'x' names too)
if (tmbegin()){
x++;
tmend();
pr_info("transaction committed\n");
} else {
pr_info("transaction aborted %x\n", tmstatus());
}
return 0;
Looks cleaner to me.
> +
> +static __always_inline void _xend(void)
> +{
> + /* Not patched because these should be not executed in fallback */
> + asm volatile(".byte 0x0f,0x01,0xd5 # XEND" : : : "memory");
> +}
> +
ppc == tend... should be fine, other than the name.
> +static __always_inline void _xabort(const unsigned int status)
> +{
> + alternative_input(ASM_NOP3,
> + ".byte 0xc6,0xf8,%P0 # XABORT",
> + X86_FEATURE_RTM,
> + "i" (status) : "memory");
> +}
> +
ppc == tabort... should be fine, other than the name.
> +static __always_inline int _xtest(void)
> +{
> + unsigned char out;
> + alternative_io("xor %0,%0 ; " ASM_NOP5,
> + ".byte 0x0f,0x01,0xd6 ; setnz %0 # XTEST",
> + X86_FEATURE_RTM,
> + "=r" (out),
> + "i" (0) : "memory");
> + return out;
> +}
> +
> +#endif
ppc = tcheck... should be fine, other than the name.
Mikey
> FYI the TM spec can be downloaded here:
> https://www.power.org/documentation/power-isa-transactional-memory/
>
> You're example code looks like this:
I don't think portable code will use this directly. Note it's in arch/x86/
Generally portable code should use higher level interfaces, like
elide_lock/elide_lock_adapt that hide the architecture specific
details.
If you want to do lock elision you would plug in some elision
algorithm that works well at that level.
> Secondly, can we make xbegin just return true/false and get the status
> later if needed?
I now removed xbegin() from the portable file, as it's only used
in arch specific code. And FWIW I'm considering to change it to save
a few instructions and go for the more efficient goto based
interface in glibc.
>
> ppc = tcheck... should be fine, other than the name.
Well x and tm doesn't really matter, but I already have x* so i'm inclined
to keep it, unless people bikeshed too strongly. It should work for PPC too.
BTW if the percpu include loop hell is ever sorted out _xtest may
even stop using XTEST.
-Andi
--
[email protected] -- Speaking for myself only
> RTM == restricted transactional memory. I don't understand why it's
> "restricted" and why any other architecture else would call it that and
It's restricted as in there is no guarantee that a transaction ever succeeds
and always needs a fallback path. My understanding is that this is true for
PPC too, but not necessarily zSeries.
> hence why Linux should call it that?
>
> Can we just call it TM for transactional memory? Each arch can then
> have their own implementation and call it what they want.
I don't think the name will stop them to do anything.
If people feel strongly about it can change it, but it's not
actually making a real difference.
-Andi
--
[email protected] -- Speaking for myself only
Andi Kleen <[email protected]> wrote:
> > RTM == restricted transactional memory. I don't understand why it's
> > "restricted" and why any other architecture else would call it that and
>
> It's restricted as in there is no guarantee that a transaction ever succeeds
> and always needs a fallback path. My understanding is that this is true for
> PPC too, but not necessarily zSeries.
Yes, ppc is limited in this way too. We just didn't call it a silly
name, although we do have the TEXASR register for TM status to make up
for it. :-)
Mikey
>
> > hence why Linux should call it that?
> >
> > Can we just call it TM for transactional memory? Each arch can then
> > have their own implementation and call it what they want.
>
> I don't think the name will stop them to do anything.
>
> If people feel strongly about it can change it, but it's not
> actually making a real difference.
>
> -Andi
> --
> [email protected] -- Speaking for myself only
>
Andi Kleen <[email protected]> wrote:
> > FYI the TM spec can be downloaded here:
> > https://www.power.org/documentation/power-isa-transactional-memory/
> >
> > You're example code looks like this:
>
> I don't think portable code will use this directly. Note it's in arch/x86/
>
> Generally portable code should use higher level interfaces, like
> elide_lock/elide_lock_adapt that hide the architecture specific
> details.
>
> If you want to do lock elision you would plug in some elision
> algorithm that works well at that level.
>
> > Secondly, can we make xbegin just return true/false and get the status
> > later if needed?
>
> I now removed xbegin() from the portable file, as it's only used
> in arch specific code.
OK, well most of my objections go away now.
> And FWIW I'm considering to change it to save a few instructions and
> go for the more efficient goto based interface in glibc.
>
> >
> > ppc = tcheck... should be fine, other than the name.
>
> Well x and tm doesn't really matter, but I already have x* so i'm inclined
> to keep it, unless people bikeshed too strongly. It should work for PPC too.
Well if you're moving it out of generic code then it doesn't really
matter anymore.
Mikey
>
> BTW if the percpu include loop hell is ever sorted out _xtest may
> even stop using XTEST.
> > Well x and tm doesn't really matter, but I already have x* so i'm inclined
> > to keep it, unless people bikeshed too strongly. It should work for PPC too.
>
> Well if you're moving it out of generic code then it doesn't really
> matter anymore.
The only operations I want generic is _xtest() and _xabort()
None of them will be very widely used, but they are useful for various
tunings.
-Andi
--
[email protected] -- Speaking for myself only.
On Saturday, March 23, 2013 6:11:52 PM UTC+1, Linus Torvalds wrote:
> On Fri, Mar 22, 2013 at 6:24 PM, Andi Kleen <[email protected]> wrote:
>
> >
> > Some questions and answers:
> >
> > - How much does it improve performance?
>
> > I cannot share any performance numbers at this point unfortunately.
> > Also please keep in mind that the tuning is very preliminary and
> > will be revised.
>
> If we don't know how much it helps, we can't judge whether it's worth
> even discussing this patch. It adds enough complexity that it had
> better be worth it, and without knowing the performance side, all we
> can see are the negatives.
>
> Talk to your managers about this. Tell them that without performance
> numbers, any patch-series like this is totally pointless.
Hello,
I don't know if the thread is still actual, but I have a Core i7 4770
as my home PC, which supports TSX. I bought it *exactly* to experiment
with hardware transactions.
I am willing to test and benchmark kernel patches, and since I do not
work for Intel I can tell all the quantitative performance differences
I find.
Obviously, they will be *my* results, not official Intel ones -
it's up to Andi Kleen or some other Intel guy to tell if they are ok
or not with this, but since CPUs with TSX are now available in shops,
non-disclosure about their performance seems a bit difficult to
enforce...
--
I can tell from my preliminary performance results that at least for
user-space RTM seems really fast. On my PC, the overhead of an empty
transaction is approximately 11 nanoseconds and a minimal transaction
reading and writing 2 or 3 memory addresses runs in approximately
15-20 nanoseconds.
I just hope I did not violate some non-disclosure condition attached
to the CPU guarantee certificate ;-)
I tested it both with GCC, using inline assembler and .byte directives,
and in Lisp (don't tell anybody), by writing a compiler module that
defines the XBEGIN, XTEST, XABORT and XEND primitives.
--
How can I help?
I would start with the patches already posted by Andi, but the ones
I found in LKML archives seem to belong to at least two different sets
of patches: xy/31 (September 2012) and xy/29 (March 2013) and I could
not find if the first ones are a prerequisite for the second.
Regards,
Massimiliano