2007-11-01 14:01:56

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/4] ticket spinlocks for x86

Hi,

I'd like to propose these patches for the x86 tree for a bit more exposure
and testing. Or at least get some discussion going again.

Just for fun I also had a shot at merging the headers, as they become a
lot more similar after this with the removal of the paravirt crud.


Nick


2007-11-01 14:02:51

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/4] spinlock: lockbreak cleanup


The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin <[email protected]>

---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1854,23 +1854,16 @@ extern int cond_resched_softirq(void);

/*
* Does a critical section need to be broken due to another
- * task waiting?:
+ * task waiting?: (technically does not depend on CONFIG_PREEMPT,
+ * but a general need for low latency)
*/
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
-# define need_lockbreak(lock) ((lock)->break_lock)
-#else
-# define need_lockbreak(lock) 0
-#endif
-
-/*
- * Does a critical section need to be broken due to another
- * task waiting or preemption being signalled:
- */
-static inline int lock_need_resched(spinlock_t *lock)
+static inline int spin_needbreak(spinlock_t *lock)
{
- if (need_lockbreak(lock) || need_resched())
- return 1;
+#ifdef CONFIG_PREEMPT
+ return spin_is_contended(lock);
+#else
return 0;
+#endif
}

/*
Index: linux-2.6/include/linux/spinlock.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -120,6 +120,12 @@ do { \

#define spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock)

+#ifdef CONFIG_GENERIC_LOCKBREAK
+#define spin_is_contended(lock) ((lock)->break_lock)
+#else
+#define spin_is_contended(lock) __raw_spin_is_contended(&(lock)->raw_lock)
+#endif
+
/**
* spin_unlock_wait - wait until the spinlock gets unlocked
* @lock: the spinlock in question.
Index: linux-2.6/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd/checkpoint.c
+++ linux-2.6/fs/jbd/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
- if (!retry && lock_need_resched(&journal->j_list_lock)){
+ if (!retry && (need_resched() ||
+ spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}

- if (lock_need_resched(&journal->j_list_lock)) {
+ if (need_resched() || spin_needbreak(&journal->j_list_lock)) {
spin_unlock(&journal->j_list_lock);
goto write_out_data;
}
Index: linux-2.6/fs/jbd2/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd2/checkpoint.c
+++ linux-2.6/fs/jbd2/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
- if (!retry && lock_need_resched(&journal->j_list_lock)){
+ if (!retry && (need_resched() ||
+ spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd2/commit.c
+++ linux-2.6/fs/jbd2/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}

- if (lock_need_resched(&journal->j_list_lock)) {
+ if (need_resched() || spin_needbreak(&journal->j_list_lock)) {
spin_unlock(&journal->j_list_lock);
goto write_out_data;
}
Index: linux-2.6/include/linux/spinlock_up.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_up.h
+++ linux-2.6/include/linux/spinlock_up.h
@@ -64,6 +64,8 @@ static inline void __raw_spin_unlock(raw
# define __raw_spin_trylock(lock) ({ (void)(lock); 1; })
#endif /* DEBUG_SPINLOCK */

+#define __raw_spin_is_contended(lock) (((void)(lock), 0))
+
#define __raw_read_can_lock(lock) (((void)(lock), 1))
#define __raw_write_can_lock(lock) (((void)(lock), 1))

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4662,19 +4662,15 @@ EXPORT_SYMBOL(cond_resched);
*/
int cond_resched_lock(spinlock_t *lock)
{
+ int resched = need_resched() && system_state == SYSTEM_RUNNING;
int ret = 0;

- if (need_lockbreak(lock)) {
+ if (spin_needbreak(lock) || resched) {
spin_unlock(lock);
- cpu_relax();
- ret = 1;
- spin_lock(lock);
- }
- if (need_resched() && system_state == SYSTEM_RUNNING) {
- spin_release(&lock->dep_map, 1, _THIS_IP_);
- _raw_spin_unlock(lock);
- preempt_enable_no_resched();
- __cond_resched();
+ if (resched && need_resched())
+ __cond_resched();
+ else
+ cpu_relax();
ret = 1;
spin_lock(lock);
}
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -511,8 +511,7 @@ again:
if (progress >= 32) {
progress = 0;
if (need_resched() ||
- need_lockbreak(src_ptl) ||
- need_lockbreak(dst_ptl))
+ spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
break;
}
if (pte_none(*src_pte)) {
@@ -851,7 +850,7 @@ unsigned long unmap_vmas(struct mmu_gath
tlb_finish_mmu(*tlbp, tlb_start, start);

if (need_resched() ||
- (i_mmap_lock && need_lockbreak(i_mmap_lock))) {
+ (i_mmap_lock && spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
*tlbp = NULL;
goto out;
@@ -1763,8 +1762,7 @@ again:

restart_addr = zap_page_range(vma, start_addr,
end_addr - start_addr, details);
- need_break = need_resched() ||
- need_lockbreak(details->i_mmap_lock);
+ need_break = need_resched() || spin_needbreak(details->i_mmap_lock);

if (restart_addr >= end_addr) {
/* We have now completed this vma: mark it so */
Index: linux-2.6/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86_64/Kconfig
+++ linux-2.6/arch/x86_64/Kconfig
@@ -78,6 +78,11 @@ config ISA
config SBUS
bool

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
bool
default y
Index: linux-2.6/include/linux/spinlock_types.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_types.h
+++ linux-2.6/include/linux/spinlock_types.h
@@ -19,7 +19,7 @@

typedef struct {
raw_spinlock_t raw_lock;
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
+#ifdef CONFIG_GENERIC_LOCKBREAK
unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
@@ -35,7 +35,7 @@ typedef struct {

typedef struct {
raw_rwlock_t raw_lock;
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
+#ifdef CONFIG_GENERIC_LOCKBREAK
unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -65,8 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
* even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
* not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
- defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)

void __lockfunc _read_lock(rwlock_t *lock)
{
Index: linux-2.6/arch/arm/Kconfig
===================================================================
--- linux-2.6.orig/arch/arm/Kconfig
+++ linux-2.6/arch/arm/Kconfig
@@ -91,6 +91,11 @@ config GENERIC_IRQ_PROBE
bool
default y

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
bool
default y
Index: linux-2.6/arch/i386/Kconfig
===================================================================
--- linux-2.6.orig/arch/i386/Kconfig
+++ linux-2.6/arch/i386/Kconfig
@@ -14,6 +14,11 @@ config X86_32
486, 586, Pentiums, and various instruction-set-compatible chips by
AMD, Cyrix, and others.

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config GENERIC_TIME
bool
default y
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -42,6 +42,11 @@ config MMU
config SWIOTLB
bool

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_XCHGADD_ALGORITHM
bool
default y
Index: linux-2.6/arch/m32r/Kconfig
===================================================================
--- linux-2.6.orig/arch/m32r/Kconfig
+++ linux-2.6/arch/m32r/Kconfig
@@ -235,6 +235,11 @@ config IRAM_SIZE
# Define implied options from the CPU selection here
#

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
bool
depends on M32R
Index: linux-2.6/arch/mips/Kconfig
===================================================================
--- linux-2.6.orig/arch/mips/Kconfig
+++ linux-2.6/arch/mips/Kconfig
@@ -671,6 +671,11 @@ source "arch/mips/vr41xx/Kconfig"

endmenu

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
bool
default y
Index: linux-2.6/arch/parisc/Kconfig
===================================================================
--- linux-2.6.orig/arch/parisc/Kconfig
+++ linux-2.6/arch/parisc/Kconfig
@@ -19,6 +19,11 @@ config MMU
config STACK_GROWSUP
def_bool y

+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
def_bool y

Index: linux-2.6/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.orig/arch/sparc64/Kconfig
+++ linux-2.6/arch/sparc64/Kconfig
@@ -200,6 +200,11 @@ config US2E_FREQ
If in doubt, say N.

# Global things across all Sun machines.
+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config RWSEM_GENERIC_SPINLOCK
bool

2007-11-01 14:03:32

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/4] x86: FIFO ticket spinlocks


Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again. These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin <[email protected]>

---
Index: linux-2.6/include/asm-x86/spinlock_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -12,74 +12,98 @@
* Simple spin lock operations. There are two variants, one clears IRQ's
* on the local processor, one does not.
*
- * We make no fairness assumptions. They have a cost.
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
*
* (the type definitions are in asm/spinlock_types.h)
*/

+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
{
- return *(volatile signed int *)(&(lock)->slock) <= 0;
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) != (tmp & 0xff));
}

-static inline void __raw_spin_lock(raw_spinlock_t *lock)
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decl %0\n\t"
- "jns 2f\n"
- "3:\n"
- "rep;nop\n\t"
- "cmpl $0,%0\n\t"
- "jle 3b\n\t"
- "jmp 1b\n"
- "2:\t" : "=m" (lock->slock) : : "memory");
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
}

-/*
- * Same as __raw_spin_lock, but reenable interrupts during spinning.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decl %0\n\t"
- "jns 5f\n"
- "testl $0x200, %1\n\t" /* interrupts were disabled? */
- "jz 4f\n\t"
- "sti\n"
- "3:\t"
- "rep;nop\n\t"
- "cmpl $0, %0\n\t"
- "jle 3b\n\t"
- "cli\n\t"
+ short inc = 0x0100;
+
+ /*
+ * Ticket locks are conceptually two bytes, one indicating the current
+ * head of the queue, and the other indicating the current tail. The
+ * lock is acquired by atomically noting the tail and incrementing it
+ * by one (thus adding ourself to the queue and noting our position),
+ * then waiting until the head becomes equal to the the initial value
+ * of the tail.
+ *
+ * This uses a 16-bit xadd to increment the tail and also load the
+ * position of the head, which takes care of memory ordering issues
+ * and should be optimal for the uncontended case. Note the tail must
+ * be in the high byte, otherwise the 16-bit wide increment of the low
+ * byte would carry up and contaminate the high byte.
+ */
+
+ __asm__ __volatile__ (
+ LOCK_PREFIX "xaddw %w0, %1\n"
+ "1:\t"
+ "cmpb %h0, %b0\n\t"
+ "je 2f\n\t"
+ "rep ; nop\n\t"
+ "movb %1, %b0\n\t"
+ /* don't need lfence here, because loads are in-order */
"jmp 1b\n"
- "4:\t"
- "rep;nop\n\t"
- "cmpl $0, %0\n\t"
- "jg 1b\n\t"
- "jmp 4b\n"
- "5:\n\t"
- : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+ "2:"
+ :"+Q" (inc), "+m" (lock->slock)
+ :
+ :"memory", "cc");
}
-#endif
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)

static inline int __raw_spin_trylock(raw_spinlock_t *lock)
{
- int oldval;
+ short prev;
+ short new;
+ int ret = 0;

asm volatile(
- "xchgl %0,%1"
- :"=q" (oldval), "=m" (lock->slock)
- :"0" (0) : "memory");
+ "movw %2,%w0\n\t"
+ "cmpb %h0,%b0\n\t"
+ "jne 1f\n\t"
+ "movw %w0,%w1\n\t"
+ "incb %h1\n\t"
+ "lock ; cmpxchgw %w1,%2\n\t"
+ "decb %h1\n\t"
+ "cmpw %w0,%w1\n\t"
+ "jne 1f\n\t"
+ "movl $1,%3\n\t"
+ "1:"
+ :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
+ :
+ : "memory", "cc");

- return oldval > 0;
+ return ret;
}

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
- asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
+ __asm__ __volatile__(
+ "incb %0"
+ :"+m" (lock->slock)
+ :
+ :"memory", "cc");
}

static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
Index: linux-2.6/include/asm-x86/spinlock_types.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_types.h
+++ linux-2.6/include/asm-x86/spinlock_types.h
@@ -9,7 +9,7 @@ typedef struct {
unsigned int slock;
} raw_spinlock_t;

-#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED { 0 }

typedef struct {
unsigned int lock;
Index: linux-2.6/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86_64/Kconfig
+++ linux-2.6/arch/x86_64/Kconfig
@@ -80,8 +80,7 @@ config SBUS

config GENERIC_LOCKBREAK
bool
- default y
- depends on SMP && PREEMPT
+ default n

config RWSEM_GENERIC_SPINLOCK
bool
Index: linux-2.6/include/asm-x86/paravirt.h
===================================================================
--- linux-2.6.orig/include/asm-x86/paravirt.h
+++ linux-2.6/include/asm-x86/paravirt.h
@@ -1071,27 +1071,6 @@ static inline unsigned long __raw_local_
return f;
}

-#define CLI_STRING \
- _paravirt_alt("pushl %%ecx; pushl %%edx;" \
- "call *%[paravirt_cli_opptr];" \
- "popl %%edx; popl %%ecx", \
- "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
-
-#define STI_STRING \
- _paravirt_alt("pushl %%ecx; pushl %%edx;" \
- "call *%[paravirt_sti_opptr];" \
- "popl %%edx; popl %%ecx", \
- "%c[paravirt_sti_type]", "%c[paravirt_clobber]")
-
-#define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS \
- , \
- [paravirt_cli_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_disable)), \
- [paravirt_cli_opptr] "m" (pv_irq_ops.irq_disable), \
- [paravirt_sti_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_enable)), \
- [paravirt_sti_opptr] "m" (pv_irq_ops.irq_enable), \
- paravirt_clobber(CLBR_EAX)
-
/* Make sure as little as possible of this mess escapes. */
#undef PARAVIRT_CALL
#undef __PVOP_CALL
Index: linux-2.6/include/asm-x86/spinlock_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -7,120 +7,116 @@
#include <asm/processor.h>
#include <linux/compiler.h>

-#ifdef CONFIG_PARAVIRT
-#include <asm/paravirt.h>
-#else
-#define CLI_STRING "cli"
-#define STI_STRING "sti"
-#define CLI_STI_CLOBBERS
-#define CLI_STI_INPUT_ARGS
-#endif /* CONFIG_PARAVIRT */
-
/*
* Your basic SMP spinlocks, allowing only a single CPU anywhere
*
* Simple spin lock operations. There are two variants, one clears IRQ's
* on the local processor, one does not.
*
- * We make no fairness assumptions. They have a cost.
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
*
* (the type definitions are in asm/spinlock_types.h)
*/

-static inline int __raw_spin_is_locked(raw_spinlock_t *x)
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
{
- return *(volatile signed char *)(&(x)->slock) <= 0;
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) != (tmp & 0xff));
}

-static inline void __raw_spin_lock(raw_spinlock_t *lock)
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
{
- asm volatile("\n1:\t"
- LOCK_PREFIX " ; decb %0\n\t"
- "jns 3f\n"
- "2:\t"
- "rep;nop\n\t"
- "cmpb $0,%0\n\t"
- "jle 2b\n\t"
- "jmp 1b\n"
- "3:\n\t"
- : "+m" (lock->slock) : : "memory");
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
}

-/*
- * It is easier for the lock validator if interrupts are not re-enabled
- * in the middle of a lock-acquire. This is a performance feature anyway
- * so we turn it off:
- *
- * NOTE: there's an irqs-on section here, which normally would have to be
- * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use this variant.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
- asm volatile(
- "\n1:\t"
- LOCK_PREFIX " ; decb %[slock]\n\t"
- "jns 5f\n"
- "2:\t"
- "testl $0x200, %[flags]\n\t"
- "jz 4f\n\t"
- STI_STRING "\n"
- "3:\t"
- "rep;nop\n\t"
- "cmpb $0, %[slock]\n\t"
- "jle 3b\n\t"
- CLI_STRING "\n\t"
+ short inc = 0x0100;
+
+ /*
+ * Ticket locks are conceptually two bytes, one indicating the current
+ * head of the queue, and the other indicating the current tail. The
+ * lock is acquired by atomically noting the tail and incrementing it
+ * by one (thus adding ourself to the queue and noting our position),
+ * then waiting until the head becomes equal to the the initial value
+ * of the tail.
+ *
+ * This uses a 16-bit xadd to increment the tail and also load the
+ * position of the head, which takes care of memory ordering issues
+ * and should be optimal for the uncontended case. Note the tail must
+ * be in the high byte, otherwise the 16-bit wide increment of the low
+ * byte would carry up and contaminate the high byte.
+ */
+
+ __asm__ __volatile__ (
+ LOCK_PREFIX "xaddw %w0, %1\n"
+ "1:\t"
+ "cmpb %h0, %b0\n\t"
+ "je 2f\n\t"
+ "rep ; nop\n\t"
+ "movb %1, %b0\n\t"
+ /* don't need lfence here, because loads are in-order */
"jmp 1b\n"
- "4:\t"
- "rep;nop\n\t"
- "cmpb $0, %[slock]\n\t"
- "jg 1b\n\t"
- "jmp 4b\n"
- "5:\n\t"
- : [slock] "+m" (lock->slock)
- : [flags] "r" (flags)
- CLI_STI_INPUT_ARGS
- : "memory" CLI_STI_CLOBBERS);
+ "2:"
+ :"+Q" (inc), "+m" (lock->slock)
+ :
+ :"memory", "cc");
}
-#endif
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)

static inline int __raw_spin_trylock(raw_spinlock_t *lock)
{
- char oldval;
+ short prev;
+ short new;
+ int ret = 0;
+
asm volatile(
- "xchgb %b0,%1"
- :"=q" (oldval), "+m" (lock->slock)
- :"0" (0) : "memory");
- return oldval > 0;
+ "movw %2,%w0\n\t"
+ "cmpb %h0,%b0\n\t"
+ "jne 1f\n\t"
+ "movw %w0,%w1\n\t"
+ "incb %h1\n\t"
+ "lock ; cmpxchgw %w1,%2\n\t"
+ "decb %h1\n\t"
+ "cmpw %w0,%w1\n\t"
+ "jne 1f\n\t"
+ "movl $1,%3\n\t"
+ "1:"
+ :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
+ :
+ : "memory", "cc");
+
+ return ret;
}

+#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
/*
- * __raw_spin_unlock based on writing $1 to the low byte.
- * This method works. Despite all the confusion.
- * (except on PPro SMP or if we are using OOSTORE, so we use xchgb there)
+ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
* (PPro errata 66, 92)
*/
-
-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
-
-static inline void __raw_spin_unlock(raw_spinlock_t *lock)
-{
- asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory");
-}
-
+#define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
+#define UNLOCK_LOCK_PREFIX
+#endif

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
- char oldval = 1;
-
- asm volatile("xchgb %b0, %1"
- : "=q" (oldval), "+m" (lock->slock)
- : "0" (oldval) : "memory");
+ __asm__ __volatile__(
+ UNLOCK_LOCK_PREFIX "incb %0"
+ :"+m" (lock->slock)
+ :
+ :"memory", "cc");
}

-#endif
-
static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
{
while (__raw_spin_is_locked(lock))

2007-11-01 14:04:22

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/4] x86: spinlock.h merge prep


Prepare for merging 32 and 64 bit spinlocks, by making them identical
(except for the OOSTORE thing). raw_read_lock and raw_write_lock get a
relaxed register constraint, and 64-bit has a few "=m" constraints changed
to "+m". I hope these things actually make the code better.

Signed-off-by: Nick Piggin <[email protected]>
---
Index: linux-2.6/include/asm-x86/spinlock_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -98,14 +98,15 @@ static inline int __raw_spin_trylock(raw
return ret;
}

-#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+#if defined(CONFIG_X86_32) && \
+ (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
/*
* On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
* (PPro errata 66, 92)
*/
-#define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
-#define UNLOCK_LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX
#endif

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
@@ -135,49 +136,42 @@ static inline void __raw_spin_unlock_wai
*
* On x86, we implement read-write locks as a 32-bit counter
* with the high bit (sign) being the "contended" bit.
- *
- * The inline assembly is non-obvious. Think about it.
- *
- * Changed to use the same technique as rw semaphores. See
- * semaphore.h for details. -ben
- *
- * the helpers are in arch/i386/kernel/semaphore.c
*/

/**
* read_can_lock - would read_trylock() succeed?
* @lock: the rwlock in question.
*/
-static inline int __raw_read_can_lock(raw_rwlock_t *x)
+static inline int __raw_read_can_lock(raw_rwlock_t *lock)
{
- return (int)(x)->lock > 0;
+ return (int)(lock)->lock > 0;
}

/**
* write_can_lock - would write_trylock() succeed?
* @lock: the rwlock in question.
*/
-static inline int __raw_write_can_lock(raw_rwlock_t *x)
+static inline int __raw_write_can_lock(raw_rwlock_t *lock)
{
- return (x)->lock == RW_LOCK_BIAS;
+ return (lock)->lock == RW_LOCK_BIAS;
}

static inline void __raw_read_lock(raw_rwlock_t *rw)
{
- asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t"
+ asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
"jns 1f\n"
"call __read_lock_failed\n\t"
"1:\n"
- ::"a" (rw) : "memory");
+ ::"r" (rw) : "memory");
}

static inline void __raw_write_lock(raw_rwlock_t *rw)
{
- asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t"
+ asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
"jz 1f\n"
"call __write_lock_failed\n\t"
"1:\n"
- ::"a" (rw) : "memory");
+ ::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
}

static inline int __raw_read_trylock(raw_rwlock_t *lock)
@@ -206,8 +200,8 @@ static inline void __raw_read_unlock(raw

static inline void __raw_write_unlock(raw_rwlock_t *rw)
{
- asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0"
- : "+m" (rw->lock) : : "memory");
+ asm volatile(LOCK_PREFIX "addl %1,%0"
+ : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
}

#define _raw_spin_relax(lock) cpu_relax()
Index: linux-2.6/include/asm-x86/spinlock_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -5,6 +5,7 @@
#include <asm/rwlock.h>
#include <asm/page.h>
#include <asm/processor.h>
+#include <linux/compiler.h>

/*
* Your basic SMP spinlocks, allowing only a single CPU anywhere
@@ -126,11 +127,19 @@ static inline void __raw_spin_unlock_wai
* with the high bit (sign) being the "contended" bit.
*/

+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
static inline int __raw_read_can_lock(raw_rwlock_t *lock)
{
return (int)(lock)->lock > 0;
}

+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
static inline int __raw_write_can_lock(raw_rwlock_t *lock)
{
return (lock)->lock == RW_LOCK_BIAS;
@@ -140,18 +149,18 @@ static inline void __raw_read_lock(raw_r
{
asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
"jns 1f\n"
- "call __read_lock_failed\n"
+ "call __read_lock_failed\n\t"
"1:\n"
- ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory");
+ ::"r" (rw) : "memory");
}

static inline void __raw_write_lock(raw_rwlock_t *rw)
{
asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
"jz 1f\n"
- "\tcall __write_lock_failed\n\t"
+ "call __write_lock_failed\n\t"
"1:\n"
- ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory");
+ ::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
}

static inline int __raw_read_trylock(raw_rwlock_t *lock)
@@ -175,13 +184,13 @@ static inline int __raw_write_trylock(ra

static inline void __raw_read_unlock(raw_rwlock_t *rw)
{
- asm volatile(LOCK_PREFIX " ; incl %0" :"=m" (rw->lock) : : "memory");
+ asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
}

static inline void __raw_write_unlock(raw_rwlock_t *rw)
{
- asm volatile(LOCK_PREFIX " ; addl $" RW_LOCK_BIAS_STR ",%0"
- : "=m" (rw->lock) : : "memory");
+ asm volatile(LOCK_PREFIX "addl %1,%0"
+ : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
}

#define _raw_spin_relax(lock) cpu_relax()

2007-11-01 14:05:19

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/4] x86: spinlock.h merge


Merge spinlock_32.h and spinlock_64.h into spinlock.h.

Signed-off-by: Nick Piggin <[email protected]>
---
Index: linux-2.6/include/asm-x86/spinlock.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock.h
+++ linux-2.6/include/asm-x86/spinlock.h
@@ -1,5 +1,211 @@
-#ifdef CONFIG_X86_32
-# include "spinlock_32.h"
+#ifndef __ASM_SPINLOCK_H
+#define __ASM_SPINLOCK_H
+
+#include <asm/atomic.h>
+#include <asm/rwlock.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+#include <linux/compiler.h>
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations. There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
+{
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) != (tmp & 0xff));
+}
+
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
+{
+ int tmp = *(volatile signed int *)(&(lock)->slock);
+
+ return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
+}
+
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
+{
+ short inc = 0x0100;
+
+ /*
+ * Ticket locks are conceptually two bytes, one indicating the current
+ * head of the queue, and the other indicating the current tail. The
+ * lock is acquired by atomically noting the tail and incrementing it
+ * by one (thus adding ourself to the queue and noting our position),
+ * then waiting until the head becomes equal to the the initial value
+ * of the tail.
+ *
+ * This uses a 16-bit xadd to increment the tail and also load the
+ * position of the head, which takes care of memory ordering issues
+ * and should be optimal for the uncontended case. Note the tail must
+ * be in the high byte, otherwise the 16-bit wide increment of the low
+ * byte would carry up and contaminate the high byte.
+ */
+
+ __asm__ __volatile__ (
+ LOCK_PREFIX "xaddw %w0, %1\n"
+ "1:\t"
+ "cmpb %h0, %b0\n\t"
+ "je 2f\n\t"
+ "rep ; nop\n\t"
+ "movb %1, %b0\n\t"
+ /* don't need lfence here, because loads are in-order */
+ "jmp 1b\n"
+ "2:"
+ :"+Q" (inc), "+m" (lock->slock)
+ :
+ :"memory", "cc");
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int __raw_spin_trylock(raw_spinlock_t *lock)
+{
+ short prev;
+ short new;
+ int ret = 0;
+
+ asm volatile(
+ "movw %2,%w0\n\t"
+ "cmpb %h0,%b0\n\t"
+ "jne 1f\n\t"
+ "movw %w0,%w1\n\t"
+ "incb %h1\n\t"
+ "lock ; cmpxchgw %w1,%2\n\t"
+ "decb %h1\n\t"
+ "cmpw %w0,%w1\n\t"
+ "jne 1f\n\t"
+ "movl $1,%3\n\t"
+ "1:"
+ :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
+ :
+ : "memory", "cc");
+
+ return ret;
+}
+
+#if defined(CONFIG_X86_32) && \
+ (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
+/*
+ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
+ * (PPro errata 66, 92)
+ */
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
-# include "spinlock_64.h"
+# define UNLOCK_LOCK_PREFIX
#endif
+
+static inline void __raw_spin_unlock(raw_spinlock_t *lock)
+{
+ __asm__ __volatile__(
+ UNLOCK_LOCK_PREFIX "incb %0"
+ :"+m" (lock->slock)
+ :
+ :"memory", "cc");
+}
+
+static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
+{
+ while (__raw_spin_is_locked(lock))
+ cpu_relax();
+}
+
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ *
+ * On x86, we implement read-write locks as a 32-bit counter
+ * with the high bit (sign) being the "contended" bit.
+ */
+
+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_read_can_lock(raw_rwlock_t *lock)
+{
+ return (int)(lock)->lock > 0;
+}
+
+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_write_can_lock(raw_rwlock_t *lock)
+{
+ return (lock)->lock == RW_LOCK_BIAS;
+}
+
+static inline void __raw_read_lock(raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
+ "jns 1f\n"
+ "call __read_lock_failed\n\t"
+ "1:\n"
+ ::"r" (rw) : "memory");
+}
+
+static inline void __raw_write_lock(raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
+ "jz 1f\n"
+ "call __write_lock_failed\n\t"
+ "1:\n"
+ ::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
+}
+
+static inline int __raw_read_trylock(raw_rwlock_t *lock)
+{
+ atomic_t *count = (atomic_t *)lock;
+ atomic_dec(count);
+ if (atomic_read(count) >= 0)
+ return 1;
+ atomic_inc(count);
+ return 0;
+}
+
+static inline int __raw_write_trylock(raw_rwlock_t *lock)
+{
+ atomic_t *count = (atomic_t *)lock;
+ if (atomic_sub_and_test(RW_LOCK_BIAS, count))
+ return 1;
+ atomic_add(RW_LOCK_BIAS, count);
+ return 0;
+}
+
+static inline void __raw_read_unlock(raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
+}
+
+static inline void __raw_write_unlock(raw_rwlock_t *rw)
+{
+ asm volatile(LOCK_PREFIX "addl %1,%0"
+ : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
+}
+
+#define _raw_spin_relax(lock) cpu_relax()
+#define _raw_read_relax(lock) cpu_relax()
+#define _raw_write_relax(lock) cpu_relax()
+
+#endif /* __ASM_SPINLOCK_H */
Index: linux-2.6/include/asm-x86/spinlock_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ /dev/null
@@ -1,211 +0,0 @@
-#ifndef __ASM_SPINLOCK_H
-#define __ASM_SPINLOCK_H
-
-#include <asm/atomic.h>
-#include <asm/rwlock.h>
-#include <asm/page.h>
-#include <asm/processor.h>
-#include <linux/compiler.h>
-
-/*
- * Your basic SMP spinlocks, allowing only a single CPU anywhere
- *
- * Simple spin lock operations. There are two variants, one clears IRQ's
- * on the local processor, one does not.
- *
- * These are fair FIFO ticket locks, which are currently limited to 256
- * CPUs.
- *
- * (the type definitions are in asm/spinlock_types.h)
- */
-
-#if (NR_CPUS > 256)
-#error spinlock supports a maximum of 256 CPUs
-#endif
-
-static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
-{
- int tmp = *(volatile signed int *)(&(lock)->slock);
-
- return (((tmp >> 8) & 0xff) != (tmp & 0xff));
-}
-
-static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
-{
- int tmp = *(volatile signed int *)(&(lock)->slock);
-
- return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
-}
-
-static inline void __raw_spin_lock(raw_spinlock_t *lock)
-{
- short inc = 0x0100;
-
- /*
- * Ticket locks are conceptually two bytes, one indicating the current
- * head of the queue, and the other indicating the current tail. The
- * lock is acquired by atomically noting the tail and incrementing it
- * by one (thus adding ourself to the queue and noting our position),
- * then waiting until the head becomes equal to the the initial value
- * of the tail.
- *
- * This uses a 16-bit xadd to increment the tail and also load the
- * position of the head, which takes care of memory ordering issues
- * and should be optimal for the uncontended case. Note the tail must
- * be in the high byte, otherwise the 16-bit wide increment of the low
- * byte would carry up and contaminate the high byte.
- */
-
- __asm__ __volatile__ (
- LOCK_PREFIX "xaddw %w0, %1\n"
- "1:\t"
- "cmpb %h0, %b0\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movb %1, %b0\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- :"+Q" (inc), "+m" (lock->slock)
- :
- :"memory", "cc");
-}
-
-#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
-
-static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-{
- short prev;
- short new;
- int ret = 0;
-
- asm volatile(
- "movw %2,%w0\n\t"
- "cmpb %h0,%b0\n\t"
- "jne 1f\n\t"
- "movw %w0,%w1\n\t"
- "incb %h1\n\t"
- "lock ; cmpxchgw %w1,%2\n\t"
- "decb %h1\n\t"
- "cmpw %w0,%w1\n\t"
- "jne 1f\n\t"
- "movl $1,%3\n\t"
- "1:"
- :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
- :
- : "memory", "cc");
-
- return ret;
-}
-
-#if defined(CONFIG_X86_32) && \
- (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
-/*
- * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
- * (PPro errata 66, 92)
- */
-# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
-#else
-# define UNLOCK_LOCK_PREFIX
-#endif
-
-static inline void __raw_spin_unlock(raw_spinlock_t *lock)
-{
- __asm__ __volatile__(
- UNLOCK_LOCK_PREFIX "incb %0"
- :"+m" (lock->slock)
- :
- :"memory", "cc");
-}
-
-static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
-{
- while (__raw_spin_is_locked(lock))
- cpu_relax();
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- *
- * On x86, we implement read-write locks as a 32-bit counter
- * with the high bit (sign) being the "contended" bit.
- */
-
-/**
- * read_can_lock - would read_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_read_can_lock(raw_rwlock_t *lock)
-{
- return (int)(lock)->lock > 0;
-}
-
-/**
- * write_can_lock - would write_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_write_can_lock(raw_rwlock_t *lock)
-{
- return (lock)->lock == RW_LOCK_BIAS;
-}
-
-static inline void __raw_read_lock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
- "jns 1f\n"
- "call __read_lock_failed\n\t"
- "1:\n"
- ::"r" (rw) : "memory");
-}
-
-static inline void __raw_write_lock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
- "jz 1f\n"
- "call __write_lock_failed\n\t"
- "1:\n"
- ::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
-}
-
-static inline int __raw_read_trylock(raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- atomic_dec(count);
- if (atomic_read(count) >= 0)
- return 1;
- atomic_inc(count);
- return 0;
-}
-
-static inline int __raw_write_trylock(raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- if (atomic_sub_and_test(RW_LOCK_BIAS, count))
- return 1;
- atomic_add(RW_LOCK_BIAS, count);
- return 0;
-}
-
-static inline void __raw_read_unlock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
-}
-
-static inline void __raw_write_unlock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "addl %1,%0"
- : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
-}
-
-#define _raw_spin_relax(lock) cpu_relax()
-#define _raw_read_relax(lock) cpu_relax()
-#define _raw_write_relax(lock) cpu_relax()
-
-#endif /* __ASM_SPINLOCK_H */
Index: linux-2.6/include/asm-x86/spinlock_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ /dev/null
@@ -1,200 +0,0 @@
-#ifndef __ASM_SPINLOCK_H
-#define __ASM_SPINLOCK_H
-
-#include <asm/atomic.h>
-#include <asm/rwlock.h>
-#include <asm/page.h>
-#include <asm/processor.h>
-#include <linux/compiler.h>
-
-/*
- * Your basic SMP spinlocks, allowing only a single CPU anywhere
- *
- * Simple spin lock operations. There are two variants, one clears IRQ's
- * on the local processor, one does not.
- *
- * These are fair FIFO ticket locks, which are currently limited to 256
- * CPUs.
- *
- * (the type definitions are in asm/spinlock_types.h)
- */
-
-#if (NR_CPUS > 256)
-#error spinlock supports a maximum of 256 CPUs
-#endif
-
-static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
-{
- int tmp = *(volatile signed int *)(&(lock)->slock);
-
- return (((tmp >> 8) & 0xff) != (tmp & 0xff));
-}
-
-static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
-{
- int tmp = *(volatile signed int *)(&(lock)->slock);
-
- return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
-}
-
-static inline void __raw_spin_lock(raw_spinlock_t *lock)
-{
- short inc = 0x0100;
-
- /*
- * Ticket locks are conceptually two bytes, one indicating the current
- * head of the queue, and the other indicating the current tail. The
- * lock is acquired by atomically noting the tail and incrementing it
- * by one (thus adding ourself to the queue and noting our position),
- * then waiting until the head becomes equal to the the initial value
- * of the tail.
- *
- * This uses a 16-bit xadd to increment the tail and also load the
- * position of the head, which takes care of memory ordering issues
- * and should be optimal for the uncontended case. Note the tail must
- * be in the high byte, otherwise the 16-bit wide increment of the low
- * byte would carry up and contaminate the high byte.
- */
-
- __asm__ __volatile__ (
- LOCK_PREFIX "xaddw %w0, %1\n"
- "1:\t"
- "cmpb %h0, %b0\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movb %1, %b0\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- :"+Q" (inc), "+m" (lock->slock)
- :
- :"memory", "cc");
-}
-
-#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
-
-static inline int __raw_spin_trylock(raw_spinlock_t *lock)
-{
- short prev;
- short new;
- int ret = 0;
-
- asm volatile(
- "movw %2,%w0\n\t"
- "cmpb %h0,%b0\n\t"
- "jne 1f\n\t"
- "movw %w0,%w1\n\t"
- "incb %h1\n\t"
- "lock ; cmpxchgw %w1,%2\n\t"
- "decb %h1\n\t"
- "cmpw %w0,%w1\n\t"
- "jne 1f\n\t"
- "movl $1,%3\n\t"
- "1:"
- :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
- :
- : "memory", "cc");
-
- return ret;
-}
-
-static inline void __raw_spin_unlock(raw_spinlock_t *lock)
-{
- __asm__ __volatile__(
- "incb %0"
- :"+m" (lock->slock)
- :
- :"memory", "cc");
-}
-
-static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
-{
- while (__raw_spin_is_locked(lock))
- cpu_relax();
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- *
- * On x86, we implement read-write locks as a 32-bit counter
- * with the high bit (sign) being the "contended" bit.
- */
-
-/**
- * read_can_lock - would read_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_read_can_lock(raw_rwlock_t *lock)
-{
- return (int)(lock)->lock > 0;
-}
-
-/**
- * write_can_lock - would write_trylock() succeed?
- * @lock: the rwlock in question.
- */
-static inline int __raw_write_can_lock(raw_rwlock_t *lock)
-{
- return (lock)->lock == RW_LOCK_BIAS;
-}
-
-static inline void __raw_read_lock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
- "jns 1f\n"
- "call __read_lock_failed\n\t"
- "1:\n"
- ::"r" (rw) : "memory");
-}
-
-static inline void __raw_write_lock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
- "jz 1f\n"
- "call __write_lock_failed\n\t"
- "1:\n"
- ::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
-}
-
-static inline int __raw_read_trylock(raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- atomic_dec(count);
- if (atomic_read(count) >= 0)
- return 1;
- atomic_inc(count);
- return 0;
-}
-
-static inline int __raw_write_trylock(raw_rwlock_t *lock)
-{
- atomic_t *count = (atomic_t *)lock;
- if (atomic_sub_and_test(RW_LOCK_BIAS, count))
- return 1;
- atomic_add(RW_LOCK_BIAS, count);
- return 0;
-}
-
-static inline void __raw_read_unlock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory");
-}
-
-static inline void __raw_write_unlock(raw_rwlock_t *rw)
-{
- asm volatile(LOCK_PREFIX "addl %1,%0"
- : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
-}
-
-#define _raw_spin_relax(lock) cpu_relax()
-#define _raw_read_relax(lock) cpu_relax()
-#define _raw_write_relax(lock) cpu_relax()
-
-#endif /* __ASM_SPINLOCK_H */

2007-11-01 14:06:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/4] spinlock: lockbreak cleanup

On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote:

> Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
> decouple it from the spinlock implementation, and make it typesafe (rwlocks
> do not have any need_lockbreak sites -- why do they even get bloated up
> with that break_lock then?).

IIRC Lee has a few patches floating about that do introduce lockbreak
stuff for rwlocks.


2007-11-01 14:29:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] spinlock: lockbreak cleanup

On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote:
>
> > Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
> > decouple it from the spinlock implementation, and make it typesafe (rwlocks
> > do not have any need_lockbreak sites -- why do they even get bloated up
> > with that break_lock then?).
>
> IIRC Lee has a few patches floating about that do introduce lockbreak
> stuff for rwlocks.

Well that would be a good reason to introduce a break_lock for them,
but previously not so much... we have rwlocks in some slightly space
critical structures (vmas, inodes, etc).

I guess it was done to make the "template" hacks eaiser. I don't really
find that in good taste, especially for important core infrastructure.
Anyway.

2007-11-01 14:41:15

by Gregory Haskins

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

Nick Piggin wrote:
> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
> is described in the comments. The straight-line lock/unlock instruction
> sequence is slightly slower than the dec based locks on modern x86 CPUs,
> however the difference is quite small on Core2 and Opteron when working out of
> cache, and becomes almost insignificant even on P4 when the lock misses cache.
> trylock is more significantly slower, but they are relatively rare.
>
> On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
> with a userspace test having a difference of up to 2x runtime per thread, and
> some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
> times. After this patch, all threads appear to finish at exactly the same
> time.

I had observed this phenomenon on some 8-ways here as well, but I didn't
have the bandwidth to code something up. Thumbs up!

Regards,
-Greg

2007-11-01 15:40:01

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 1/4] spinlock: lockbreak cleanup

On Thu, 2007-11-01 at 15:29 +0100, Nick Piggin wrote:
> On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote:
> > On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote:
> >
> > > Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
> > > decouple it from the spinlock implementation, and make it typesafe (rwlocks
> > > do not have any need_lockbreak sites -- why do they even get bloated up
> > > with that break_lock then?).
> >
> > IIRC Lee has a few patches floating about that do introduce lockbreak
> > stuff for rwlocks.
>
> Well that would be a good reason to introduce a break_lock for them,
> but previously not so much... we have rwlocks in some slightly space
> critical structures (vmas, inodes, etc).
>
> I guess it was done to make the "template" hacks eaiser. I don't really
> find that in good taste, especially for important core infrastructure.
> Anyway.

Actually, what I had/have is a cond_resched_rwlock() that I needed to
convert the i_mmap_lock() to rw for testing reclaim scalability. [I've
seen a large system running an Oracle OLTP load hang spitting "cpu soft
lockup" messages with all cpus spinning on a i_mmap_lock spin lock.]
One of the i_mmap_lock paths uses cond_resched_lock() for spin locks.
To do a straight forward conversion [and maybe that isn't the right
approach], I created the cond_resched_rwlock() function by generalizing
the cond_sched_lock() code and creating both spin and rw lock wrappers.
I took advantage of the fact that, currently, need_lockbreak() is a
macro and that both spin and rw locks have/had the break_lock member.
Typesafe functions would probably be preferrable, if we want to keep
break_lock for rw spin locks.

Here's the most recent posting:

http://marc.info/?l=linux-mm&m=118980356306014&w=4

See the changes to sched.[ch]. Should apply to 23-mm1 with offsets and
minor fixup in fs/inode.c.

Lee



2007-11-01 15:47:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/4] spinlock: lockbreak cleanup


* Lee Schermerhorn <[email protected]> wrote:

> > I guess it was done to make the "template" hacks eaiser. I don't
> > really find that in good taste, especially for important core
> > infrastructure. Anyway.
>
> Actually, what I had/have is a cond_resched_rwlock() that I needed to
> convert the i_mmap_lock() to rw for testing reclaim scalability.
> [I've seen a large system running an Oracle OLTP load hang spitting
> "cpu soft lockup" messages with all cpus spinning on a i_mmap_lock
> spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for
> spin locks. To do a straight forward conversion [and maybe that isn't
> the right approach], I created the cond_resched_rwlock() function by
> generalizing the cond_sched_lock() code and creating both spin and rw
> lock wrappers. I took advantage of the fact that, currently,
> need_lockbreak() is a macro and that both spin and rw locks have/had
> the break_lock member. Typesafe functions would probably be
> preferrable, if we want to keep break_lock for rw spin locks.
>
> Here's the most recent posting:
>
> http://marc.info/?l=linux-mm&m=118980356306014&w=4
>
> See the changes to sched.[ch]. Should apply to 23-mm1 with offsets
> and minor fixup in fs/inode.c.

yep. I'm too in favor of keeping the need-lockbreak mechanism and its
type-insensitive data structure. We've got way too many locking
primitives and keeping them all sorted is nontrivial already. I wouldnt
mind seeing the need_lockbreak flag move into one of the high bits of
spinlocks though, to compress size.

Ingo

2007-11-01 15:53:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] spinlock: lockbreak cleanup

On Thu, Nov 01, 2007 at 04:46:36PM +0100, Ingo Molnar wrote:
>
> * Lee Schermerhorn <[email protected]> wrote:
>
> > > I guess it was done to make the "template" hacks eaiser. I don't
> > > really find that in good taste, especially for important core
> > > infrastructure. Anyway.
> >
> > Actually, what I had/have is a cond_resched_rwlock() that I needed to
> > convert the i_mmap_lock() to rw for testing reclaim scalability.
> > [I've seen a large system running an Oracle OLTP load hang spitting
> > "cpu soft lockup" messages with all cpus spinning on a i_mmap_lock
> > spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for
> > spin locks. To do a straight forward conversion [and maybe that isn't
> > the right approach], I created the cond_resched_rwlock() function by
> > generalizing the cond_sched_lock() code and creating both spin and rw
> > lock wrappers. I took advantage of the fact that, currently,
> > need_lockbreak() is a macro and that both spin and rw locks have/had
> > the break_lock member. Typesafe functions would probably be
> > preferrable, if we want to keep break_lock for rw spin locks.
> >
> > Here's the most recent posting:
> >
> > http://marc.info/?l=linux-mm&m=118980356306014&w=4
> >
> > See the changes to sched.[ch]. Should apply to 23-mm1 with offsets
> > and minor fixup in fs/inode.c.
>
> yep. I'm too in favor of keeping the need-lockbreak mechanism and its
> type-insensitive data structure. We've got way too many locking
> primitives and keeping them all sorted is nontrivial already.

I think a large contributor to that is being a bit clever with indirections
and cute code (eg. like this template stuff), rather than having two types of
spinlocks instead of one.


2007-11-01 16:38:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks



On Thu, 1 Nov 2007, Gregory Haskins wrote:
>
> I had observed this phenomenon on some 8-ways here as well, but I didn't
> have the bandwidth to code something up. Thumbs up!

Can you test under interesting loads?

We're interested in:
- is the unfairness fix really noticeable (or does it just move the
problem somewhere else, and there is no real change in behaviour)
- what is the performance impact?

In particular, unfair spinlocks have the potential to perform much better.
Not so much because the spinlock itself acts all that differently, but
because being unfair also fundmanetally tends to keep the data structures
that are *protected* by the spinlock on just one CPU.

So "unfair" is obviously always bad. Except when it isn't. I'd personally
like to merge the ticket spinlocks, but I'd really like to have people who
have real loads where they matter actually also do some performance
testing. Because I do think it will potentially be a performance downer.

(I obviously hope it won't be, but..)

Linus

2007-11-01 20:02:17

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On 11/01/2007 10:03 AM, Nick Piggin wrote:

[edited to show the resulting code]

> + __asm__ __volatile__ (
> + LOCK_PREFIX "xaddw %w0, %1\n"
> + "1:\t"
> + "cmpb %h0, %b0\n\t"
> + "je 2f\n\t"
> + "rep ; nop\n\t"
> + "movb %1, %b0\n\t"
> + /* don't need lfence here, because loads are in-order */
> "jmp 1b\n"
> + "2:"
> + :"+Q" (inc), "+m" (lock->slock)
> + :
> + :"memory", "cc");
> }

If you really thought you might get long queues, you could figure out
how far back you are and use that to determine how long to wait before
testing the lock again. That cmpb could become a subb without adding
overhead to the fast path -- that would give you the queue length (or
its complement anyway.)

2007-11-02 00:01:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote:
> On 11/01/2007 10:03 AM, Nick Piggin wrote:
>
> [edited to show the resulting code]
>
> > + __asm__ __volatile__ (
> > + LOCK_PREFIX "xaddw %w0, %1\n"
> > + "1:\t"
> > + "cmpb %h0, %b0\n\t"
> > + "je 2f\n\t"
> > + "rep ; nop\n\t"
> > + "movb %1, %b0\n\t"
> > + /* don't need lfence here, because loads are in-order */
> > "jmp 1b\n"
> > + "2:"
> > + :"+Q" (inc), "+m" (lock->slock)
> > + :
> > + :"memory", "cc");
> > }
>
> If you really thought you might get long queues, you could figure out
> how far back you are and use that to determine how long to wait before
> testing the lock again. That cmpb could become a subb without adding
> overhead to the fast path -- that would give you the queue length (or
> its complement anyway.)

Indeed. You can use this as a really nice input into a backoff
algorithm (eg. if you're next in line, don't back off, or at least
don't go into exponential backoff; if you've got people in front
of you, start throttling harder).

I think I'll leave that to SGI if they come up with a big x86 SSI ;)

2007-11-02 00:35:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Thu, 1 Nov 2007 09:38:22 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> So "unfair" is obviously always bad. Except when it isn't.

Larry Woodman managed to wedge the VM into a state where, on his
4x dual core system, only 2 cores (on the same CPU) could get the
zone->lru_lock overnight. The other 6 cores on the system were
just spinning, without being able to get the lock.

On the other hand, spinlock contention in the page replacement
code is just a symptom of the fact that we scan too many pages.
It can probably be fixed in other ways...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-11-02 01:20:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks



On Thu, 1 Nov 2007, Rik van Riel wrote:
>
> Larry Woodman managed to wedge the VM into a state where, on his
> 4x dual core system, only 2 cores (on the same CPU) could get the
> zone->lru_lock overnight. The other 6 cores on the system were
> just spinning, without being able to get the lock.

.. and this is almost always the result of a locking *bug*, not unfairness
per se. IOW, unfairness just ends up showing the bug in the first place.

Linus

2007-11-02 02:01:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Thu, 1 Nov 2007 18:19:41 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> >
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight. The other 6 cores on the system were
> > just spinning, without being able to get the lock.
>
> .. and this is almost always the result of a locking *bug*, not
> unfairness per se. IOW, unfairness just ends up showing the bug in
> the first place.

No argument there. If you have the kind of lock contention where
fairness matters, the contention is probably what needs to be fixed,
not the locking mechanism.

Having said that, making bugs like that less likely to totally wedge
a system would be a good thing for everybody who uses Linux in
production. Exposing bugs is good for development, bad for business.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-11-02 06:42:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> >
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight. The other 6 cores on the system were
> > just spinning, without being able to get the lock.

That's quite incredible, considering that the CPUs actually _taking_
the locks also drop the locks and do quite a bit of work before taking
them again (ie. they take them to pull pages off the LRU, but then
do a reasonable amount of work to remove each one from pagecache before
refilling from the LRU).

Possibly actually that is a *more* difficult case for the HW to handle:
once the CPU actually goes away and operates on other cachelines, it
may get a little more difficult to detect that it is causing starvation
issues.


> .. and this is almost always the result of a locking *bug*, not unfairness
> per se. IOW, unfairness just ends up showing the bug in the first place.

I'd almost agree, but there are always going to be corner cases where
we get multiple contentions on a spinlock -- the fact that a lock is
needed at all obviously suggests that it can be contended. The LRU locking
could be improved, but you could have eg. scheduler runqueue lock starvation
if the planets lined up just right, and it is a little more difficult to
improve on runqueue locking.

Anyway, I also think this is partially a hardware issue, and as muliple
cores, threads, and sockets get more common, I hope it will improve (it
affects Intel CPUs as well as AMD). So it is possible to have an option
to switch between locks if the hardware is fairer, but I want to get
as much exposure with this locking as possible for now, to see if there
is any funny performance corner cases exposed (which quite possibly will
turn out to be caused by suboptimal locking itself).

Anyway, if this can make its way to the x86 tree, I think it will get
pulled into -mm (?) and get some exposure...

2007-11-02 14:06:07

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Fri, 2 Nov 2007 07:42:20 +0100
Nick Piggin <[email protected]> wrote:

> On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> >
> >
> > On Thu, 1 Nov 2007, Rik van Riel wrote:
> > >
> > > Larry Woodman managed to wedge the VM into a state where, on his
> > > 4x dual core system, only 2 cores (on the same CPU) could get the
> > > zone->lru_lock overnight. The other 6 cores on the system were
> > > just spinning, without being able to get the lock.
>
> That's quite incredible, considering that the CPUs actually _taking_
> the locks also drop the locks and do quite a bit of work before taking
> them again (ie. they take them to pull pages off the LRU, but then
> do a reasonable amount of work to remove each one from pagecache
> before refilling from the LRU).
>
> Possibly actually that is a *more* difficult case for the HW to
> handle: once the CPU actually goes away and operates on other
> cachelines, it may get a little more difficult to detect that it is
> causing starvation issues.

In case of the zone->lru_lock, grabbing the spinlock does
not mean that the process is letting go of the cacheline.

On the contrary, once the spinlock has been grabbed, the
real cacheline prodding begins.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-11-02 14:24:58

by Gregory Haskins

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

Linus Torvalds wrote:
>
> On Thu, 1 Nov 2007, Gregory Haskins wrote:
>> I had observed this phenomenon on some 8-ways here as well, but I didn't
>> have the bandwidth to code something up. Thumbs up!
>
> Can you test under interesting loads?

Sure thing. Ill try this next week.

>
> We're interested in:
> - is the unfairness fix really noticeable (or does it just move the
> problem somewhere else, and there is no real change in behaviour)
> - what is the performance impact?
>
> In particular, unfair spinlocks have the potential to perform much better.

I see where you are going here, and I mostly agree. I think the key is
that "given equal contention, let the guy with the hottest cache win".
The problem with the current implementation is that the spinlocks have
no way to gauge the details of the contention. They can only gauge
instantaneous snapshots of state as viewed by each TSL invocation, which
effectively resets your position each time.

On the flip side, Nick's patches take the opposite extreme. If a lock
is contended, get in line. ;) This has the desirable property of
avoiding starvation. However, it will also tend to cause more bouncing
since you are virtually guaranteed not to re-win the contended lock, as
you point out next.

> Not so much because the spinlock itself acts all that differently, but
> because being unfair also fundmanetally tends to keep the data structures
> that are *protected* by the spinlock on just one CPU.

My issue here is that this behavior can also be described as precisely
part of the problem being addressed: That is, both CPUs presumably
*want/need* access to the data or they wouldn't be taking the spinlock
to begin with. So its really not a question of keeping the structures
on one cpu per se (at least, not for unbounded durations or the system
won't operate properly).

Rather, I think the key is to minimize the impact by bouncing things
intelligently. ;) I.e. If all things are equal, favor the hottest task
so the data only bounces once instead of twice. Outside of this
condition, operate strict FIFO. If we can reasonably calculate when
this optimization is possible, we will have the best of both worlds. I
have some ideas about ways to extend Nicks algorithm to support this
which I will submit ASAP.

I think the rest of what you said is very fair: Prove that it's a
problem, this concept helps, and we don't make things worse ;)

Will do, ASAP.

Regards,
-Greg

2007-11-02 15:33:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks


* Nick Piggin <[email protected]> wrote:

> Anyway, if this can make its way to the x86 tree, I think it will get
> pulled into -mm (?) and get some exposure...

ok, we can certainly try it there. Your code is really nifty.

Ingo

2007-11-02 16:23:15

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On 11/01/2007 10:03 AM, Nick Piggin wrote:
> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
> is described in the comments. The straight-line lock/unlock instruction
> sequence is slightly slower than the dec based locks on modern x86 CPUs,
> however the difference is quite small on Core2 and Opteron when working out of
> cache, and becomes almost insignificant even on P4 when the lock misses cache.
> trylock is more significantly slower, but they are relatively rare.
>
> On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
> with a userspace test having a difference of up to 2x runtime per thread, and
> some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
> times. After this patch, all threads appear to finish at exactly the same
> time.

There's also a very easy way to get better fairness with our current spinlocks:
use xchg to release the lock instead of mov.

2007-11-02 16:52:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks



On Fri, 2 Nov 2007, Chuck Ebbert wrote:
>
> There's also a very easy way to get better fairness with our current spinlocks:
> use xchg to release the lock instead of mov.

That does nothing at all.

Yes, it slows the unlock down, which in turn on some machines will make it
easier for another core/socket to get it, but it's purely about the
slowdown, nothing else.

Linus

2007-11-02 22:37:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote:
> On Fri, 2 Nov 2007 07:42:20 +0100
> Nick Piggin <[email protected]> wrote:
>
> > On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> > >
> > >
> > > On Thu, 1 Nov 2007, Rik van Riel wrote:
> > > >
> > > > Larry Woodman managed to wedge the VM into a state where, on his
> > > > 4x dual core system, only 2 cores (on the same CPU) could get the
> > > > zone->lru_lock overnight. The other 6 cores on the system were
> > > > just spinning, without being able to get the lock.
> >
> > That's quite incredible, considering that the CPUs actually _taking_
> > the locks also drop the locks and do quite a bit of work before taking
> > them again (ie. they take them to pull pages off the LRU, but then
> > do a reasonable amount of work to remove each one from pagecache
> > before refilling from the LRU).
> >
> > Possibly actually that is a *more* difficult case for the HW to
> > handle: once the CPU actually goes away and operates on other
> > cachelines, it may get a little more difficult to detect that it is
> > causing starvation issues.
>
> In case of the zone->lru_lock, grabbing the spinlock does
> not mean that the process is letting go of the cacheline.
>
> On the contrary, once the spinlock has been grabbed, the
> real cacheline prodding begins.

I didn't say that, though. Obviously the hardware can't do anything
about starvating until a lock is released.

2007-11-02 23:01:51

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote:
>
>
> On Fri, 2 Nov 2007, Chuck Ebbert wrote:
> >
> > There's also a very easy way to get better fairness with our current spinlocks:
> > use xchg to release the lock instead of mov.
>
> That does nothing at all.
>
> Yes, it slows the unlock down, which in turn on some machines will make it
> easier for another core/socket to get it, but it's purely about the
> slowdown, nothing else.

Yeah, it's not such a good idea... it slows down the single threaded case
like crazy. On my dual core core2:

_Single thread_
inc-lock in cache takes 21.94ns
xadd-lock in cache takes 22.64ns
xchg-lock in cache takes 35.21ns

inc-lock out of cache takes 140.73ns
xadd-lock out of cache takes 141.15ns
xchg-lock out of cache takes 155.13ns


In the contended multi-threaded tight loop, the xchg lock is slower than inc
lock but still beats the fair xadd lock, but that's only because it is
just as unfair if not more so on this hardware (runtime difference of up to
about 10%)

2007-11-03 00:57:14

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On 11/02/2007 07:01 PM, Nick Piggin wrote:
>
> In the contended multi-threaded tight loop, the xchg lock is slower than inc
> lock but still beats the fair xadd lock, but that's only because it is
> just as unfair if not more so on this hardware (runtime difference of up to
> about 10%)
>

I meant xchg for unlock, not lock.

2007-11-03 03:41:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote:
> On 11/02/2007 07:01 PM, Nick Piggin wrote:
> >
> > In the contended multi-threaded tight loop, the xchg lock is slower than inc
> > lock but still beats the fair xadd lock, but that's only because it is
> > just as unfair if not more so on this hardware (runtime difference of up to
> > about 10%)
> >
>
> I meant xchg for unlock, not lock.

That is for unlock. 2x the number of atomic operations ~= 2x the cost.

2007-11-03 22:37:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 0/4] ticket spinlocks for x86

Nick Piggin wrote:
> Just for fun I also had a shot at merging the headers, as they become a
> lot more similar after this with the removal of the paravirt crud.

Glommer posted a set of patches the other day to implement x86-64
paravirt, which unifies lots of things including spinlocks. But if
you've removed the need to diddle with sti/cli, then that works too...

J

2007-11-07 08:46:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] x86: FIFO ticket spinlocks

On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > Anyway, if this can make its way to the x86 tree, I think it will get
> > pulled into -mm (?) and get some exposure...
>
> ok, we can certainly try it there.

Anything particular I have to do to get it into the x86 tree? And
presumably that tree is going to be picked up by Andrew at some
point? (-mm exposure is the main objective here, the only reason I
didn't send it to Andrew directly is in the interests of doing the
right thing).


> Your code is really nifty.

Thanks. One of the CPU engineers at Intel asked to use it as their reference
ticket lock code sequence, which was pretty cool :)