2011-06-16 21:41:25

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

From: Jeremy Fitzhardinge <[email protected]>

Hi all,

I'm proposing this series for 3[.0].1.

This is a repost of a series to clean up the x86 ticket lock
code by converting it to a mostly C implementation and removing
lots of duplicate code relating to the ticket size.

The last time I posted this series, the only significant comments
were from Nick Piggin, specifically relating to:

1. A wrongly placed barrier on unlock (which may have allowed the
compiler to move things out of the locked region. I went
belt-and-suspenders by having two barriers to prevent motion
into or out of the locked region.

2. With NR_CPUS < 256 the ticket size is 8 bits. The compiler doesn't
use the same trick as the hand-coded asm to directly compare the high
and low bytes in the word, but does a bit of extra shuffling around.
However, the Intel optimisation guide and several x86 experts have
opined that its best to avoid the high-byte operations anyway, since
they will cause a partial word stall, and the gcc-generated code should
be better.

Overall the compiler-generated code is very similar to the hand-coded
versions, with the partial byte operations being the only significant
difference. (Curiously, gcc does generate a high-byte compare for me
in trylock, so it can if it wants to.)

I've been running with this code in place for several months on 4 core
systems without any problems.

I couldn't measure a consistent performance difference between the two
implemenations; there seemed to be +/- ~1% +/-, which is the level of
variation I see from simply recompiling the kernel with slightly
different code alignment.

Overall, I think the large reduction in code size is a big win.

Thanks,
J

Jeremy Fitzhardinge (7):
x86/ticketlock: clean up types and accessors
x86/ticketlock: convert spin loop to C
x86/ticketlock: Use C for __ticket_spin_unlock
x86/ticketlock: make large and small ticket versions of spin_lock the
same
x86/ticketlock: make __ticket_spin_lock common
x86/ticketlock: make __ticket_spin_trylock common
x86/ticketlock: prevent memory accesses from reordered out of lock
region

arch/x86/include/asm/spinlock.h | 147 ++++++++++++---------------------
arch/x86/include/asm/spinlock_types.h | 22 +++++-
2 files changed, 74 insertions(+), 95 deletions(-)

--
1.7.5.4


2011-06-16 21:41:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 1/7] x86/ticketlock: clean up types and accessors

From: Jeremy Fitzhardinge <[email protected]>

A few cleanups to the way spinlocks are defined and accessed:
- define __ticket_t which is the size of a spinlock ticket (ie, enough
bits to hold all the cpus)
- Define struct arch_spinlock as a union containing plain slock and
the head and tail tickets
- Use head and tail to implement some of the spinlock predicates.
- Make all ticket variables unsigned.
- Use TICKET_SHIFT to form constants

Most of this will be used in later patches.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 24 ++++++++++--------------
arch/x86/include/asm/spinlock_types.h | 20 ++++++++++++++++++--
2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3089f70..d6d5784 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -56,11 +56,9 @@
* much between them in performance though, especially as locks are out of line.
*/
#if (NR_CPUS < 256)
-#define TICKET_SHIFT 8
-
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- short inc = 0x0100;
+ unsigned short inc = 1 << TICKET_SHIFT;

asm volatile (
LOCK_PREFIX "xaddw %w0, %1\n"
@@ -79,7 +77,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- int tmp, new;
+ unsigned int tmp, new;

asm volatile("movzwl %2, %0\n\t"
"cmpb %h0,%b0\n\t"
@@ -104,12 +102,10 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
: "memory", "cc");
}
#else
-#define TICKET_SHIFT 16
-
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- int inc = 0x00010000;
- int tmp;
+ unsigned inc = 1 << TICKET_SHIFT;
+ unsigned tmp;

asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
"movzwl %w0, %2\n\t"
@@ -129,8 +125,8 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- int tmp;
- int new;
+ unsigned tmp;
+ unsigned new;

asm volatile("movl %2,%0\n\t"
"movl %0,%1\n\t"
@@ -160,16 +156,16 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)

static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
- int tmp = ACCESS_ONCE(lock->slock);
+ struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);

- return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+ return !!(tmp.tail ^ tmp.head);
}

static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
{
- int tmp = ACCESS_ONCE(lock->slock);
+ struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);

- return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+ return ((tmp.tail - tmp.head) & TICKET_MASK) > 1;
}

#ifndef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index dcb48b2..e3ad1e3 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -5,11 +5,27 @@
# error "please don't include this file directly"
#endif

+#include <linux/types.h>
+
+#if (CONFIG_NR_CPUS < 256)
+typedef u8 __ticket_t;
+#else
+typedef u16 __ticket_t;
+#endif
+
+#define TICKET_SHIFT (sizeof(__ticket_t) * 8)
+#define TICKET_MASK ((__ticket_t)((1 << TICKET_SHIFT) - 1))
+
typedef struct arch_spinlock {
- unsigned int slock;
+ union {
+ unsigned int slock;
+ struct __raw_tickets {
+ __ticket_t head, tail;
+ } tickets;
+ };
} arch_spinlock_t;

-#define __ARCH_SPIN_LOCK_UNLOCKED { 0 }
+#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } }

typedef struct {
unsigned int lock;
--
1.7.5.4

2011-06-16 21:41:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 2/7] x86/ticketlock: convert spin loop to C

From: Jeremy Fitzhardinge <[email protected]>

The inner loop of __ticket_spin_lock isn't doing anything very special,
so reimplement it in C.

For the 8 bit ticket lock variant, we use a register union to get direct
access to the lower and upper bytes in the tickets, but unfortunately gcc
won't generate a direct comparison between the two halves of the register,
so the generated asm isn't quite as pretty as the hand-coded version.
However benchmarking shows that this is actually a small improvement in
runtime performance on some benchmarks, and never a slowdown.

We also need to make sure there's a barrier at the end of the lock loop
to make sure that the compiler doesn't move any instructions from within
the locked region into the region where we don't yet own the lock.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 58 +++++++++++++++++++-------------------
1 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index d6d5784..f48a6e3 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -58,21 +58,21 @@
#if (NR_CPUS < 256)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- unsigned short inc = 1 << TICKET_SHIFT;
-
- asm volatile (
- LOCK_PREFIX "xaddw %w0, %1\n"
- "1:\t"
- "cmpb %h0, %b0\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movb %1, %b0\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- : "+Q" (inc), "+m" (lock->slock)
- :
- : "memory", "cc");
+ register union {
+ struct __raw_tickets tickets;
+ unsigned short slock;
+ } inc = { .slock = 1 << TICKET_SHIFT };
+
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
+ : "+Q" (inc), "+m" (lock->slock) : : "memory", "cc");
+
+ for (;;) {
+ if (inc.tickets.head == inc.tickets.tail)
+ goto out;
+ cpu_relax();
+ inc.tickets.head = ACCESS_ONCE(lock->tickets.head);
+ }
+out: barrier(); /* make sure nothing creeps before the lock is taken */
}

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
@@ -105,22 +105,22 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
unsigned inc = 1 << TICKET_SHIFT;
- unsigned tmp;
+ __ticket_t tmp;

- asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
- "movzwl %w0, %2\n\t"
- "shrl $16, %0\n\t"
- "1:\t"
- "cmpl %0, %2\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movzwl %1, %2\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- : "+r" (inc), "+m" (lock->slock), "=&r" (tmp)
- :
- : "memory", "cc");
+ asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
+ : "+r" (inc), "+m" (lock->slock)
+ : : "memory", "cc");
+
+ tmp = inc;
+ inc >>= TICKET_SHIFT;
+
+ for (;;) {
+ if ((__ticket_t)inc == tmp)
+ goto out;
+ cpu_relax();
+ tmp = ACCESS_ONCE(lock->tickets.head);
+ }
+out: barrier(); /* make sure nothing creeps before the lock is taken */
}

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
--
1.7.5.4

2011-06-16 21:41:15

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 3/7] x86/ticketlock: Use C for __ticket_spin_unlock

From: Jeremy Fitzhardinge <[email protected]>

If we don't need to use a locked inc for unlock, then implement it in C.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 32 +++++++++++++++++---------------
1 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index f48a6e3..0170ba9 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -33,9 +33,21 @@
* On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
* (PPro errata 66, 92)
*/
-# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
+{
+ if (sizeof(lock->tickets.head) == sizeof(u8))
+ asm (LOCK_PREFIX "incb %0"
+ : "+m" (lock->tickets.head) : : "memory");
+ else
+ asm (LOCK_PREFIX "incw %0"
+ : "+m" (lock->tickets.head) : : "memory");
+
+}
#else
-# define UNLOCK_LOCK_PREFIX
+static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
+{
+ lock->tickets.head++;
+}
#endif

/*
@@ -93,14 +105,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)

return tmp;
}
-
-static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
-{
- asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
- : "+m" (lock->slock)
- :
- : "memory", "cc");
-}
#else
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
@@ -144,15 +148,13 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)

return tmp;
}
+#endif

static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
- asm volatile(UNLOCK_LOCK_PREFIX "incw %0"
- : "+m" (lock->slock)
- :
- : "memory", "cc");
+ __ticket_unlock_release(lock);
+ barrier(); /* prevent reordering into locked region */
}
-#endif

static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
--
1.7.5.4

2011-06-16 21:41:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 4/7] x86/ticketlock: make large and small ticket versions of spin_lock the same

From: Jeremy Fitzhardinge <[email protected]>

Make the bulk of __ticket_spin_lock look identical for large and small
number of cpus.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 23 ++++++++---------------
1 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0170ba9..1b81809 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -70,19 +70,16 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
#if (NR_CPUS < 256)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- register union {
- struct __raw_tickets tickets;
- unsigned short slock;
- } inc = { .slock = 1 << TICKET_SHIFT };
+ register struct __raw_tickets inc = { .tail = 1 };

asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
- : "+Q" (inc), "+m" (lock->slock) : : "memory", "cc");
+ : "+r" (inc), "+m" (lock->tickets) : : "memory", "cc");

for (;;) {
- if (inc.tickets.head == inc.tickets.tail)
+ if (inc.head == inc.tail)
goto out;
cpu_relax();
- inc.tickets.head = ACCESS_ONCE(lock->tickets.head);
+ inc.head = ACCESS_ONCE(lock->tickets.head);
}
out: barrier(); /* make sure nothing creeps before the lock is taken */
}
@@ -108,21 +105,17 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
#else
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- unsigned inc = 1 << TICKET_SHIFT;
- __ticket_t tmp;
+ register struct __raw_tickets inc = { .tail = 1 };

asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
- : "+r" (inc), "+m" (lock->slock)
+ : "+r" (inc), "+m" (lock->tickets)
: : "memory", "cc");

- tmp = inc;
- inc >>= TICKET_SHIFT;
-
for (;;) {
- if ((__ticket_t)inc == tmp)
+ if (inc.head == inc.tail)
goto out;
cpu_relax();
- tmp = ACCESS_ONCE(lock->tickets.head);
+ inc.head = ACCESS_ONCE(lock->tickets.head);
}
out: barrier(); /* make sure nothing creeps before the lock is taken */
}
--
1.7.5.4

2011-06-16 21:42:29

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 5/7] x86/ticketlock: make __ticket_spin_lock common

From: Jeremy Fitzhardinge <[email protected]>

Aside from the particular form of the xadd instruction, they're identical.
So factor out the xadd and use common code for the rest.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 42 ++++++++++++++++++--------------------
1 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 1b81809..f722f96 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -67,13 +67,27 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
* save some instructions and make the code more elegant. There really isn't
* much between them in performance though, especially as locks are out of line.
*/
-#if (NR_CPUS < 256)
-static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
+static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock)
{
- register struct __raw_tickets inc = { .tail = 1 };
+ register struct __raw_tickets tickets = { .tail = 1 };
+
+ if (sizeof(lock->tickets.head) == sizeof(u8))
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
+ : "+r" (tickets), "+m" (lock->tickets)
+ : : "memory", "cc");
+ else
+ asm volatile (LOCK_PREFIX "xaddl %0, %1\n"
+ : "+r" (tickets), "+m" (lock->tickets)
+ : : "memory", "cc");

- asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
- : "+r" (inc), "+m" (lock->tickets) : : "memory", "cc");
+ return tickets;
+}
+
+static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
+{
+ register struct __raw_tickets inc;
+
+ inc = __ticket_spin_claim(lock);

for (;;) {
if (inc.head == inc.tail)
@@ -84,6 +98,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
out: barrier(); /* make sure nothing creeps before the lock is taken */
}

+#if (NR_CPUS < 256)
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
unsigned int tmp, new;
@@ -103,23 +118,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
return tmp;
}
#else
-static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
-{
- register struct __raw_tickets inc = { .tail = 1 };
-
- asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
- : "+r" (inc), "+m" (lock->tickets)
- : : "memory", "cc");
-
- for (;;) {
- if (inc.head == inc.tail)
- goto out;
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
- }
-out: barrier(); /* make sure nothing creeps before the lock is taken */
-}
-
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
unsigned tmp;
--
1.7.5.4

2011-06-16 21:42:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 6/7] x86/ticketlock: make __ticket_spin_trylock common

From: Jeremy Fitzhardinge <[email protected]>

Make trylock code common regardless of ticket size.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 49 +++++++--------------------------
arch/x86/include/asm/spinlock_types.h | 6 +++-
2 files changed, 14 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index f722f96..3afb1a7 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -98,48 +98,19 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
out: barrier(); /* make sure nothing creeps before the lock is taken */
}

-#if (NR_CPUS < 256)
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- unsigned int tmp, new;
-
- asm volatile("movzwl %2, %0\n\t"
- "cmpb %h0,%b0\n\t"
- "leal 0x100(%" REG_PTR_MODE "0), %1\n\t"
- "jne 1f\n\t"
- LOCK_PREFIX "cmpxchgw %w1,%2\n\t"
- "1:"
- "sete %b1\n\t"
- "movzbl %b1,%0\n\t"
- : "=&a" (tmp), "=&q" (new), "+m" (lock->slock)
- :
- : "memory", "cc");
-
- return tmp;
-}
-#else
-static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
-{
- unsigned tmp;
- unsigned new;
-
- asm volatile("movl %2,%0\n\t"
- "movl %0,%1\n\t"
- "roll $16, %0\n\t"
- "cmpl %0,%1\n\t"
- "leal 0x00010000(%" REG_PTR_MODE "0), %1\n\t"
- "jne 1f\n\t"
- LOCK_PREFIX "cmpxchgl %1,%2\n\t"
- "1:"
- "sete %b1\n\t"
- "movzbl %b1,%0\n\t"
- : "=&a" (tmp), "=&q" (new), "+m" (lock->slock)
- :
- : "memory", "cc");
-
- return tmp;
+ arch_spinlock_t old, new;
+
+ old.tickets = ACCESS_ONCE(lock->tickets);
+ if (old.tickets.head != old.tickets.tail)
+ return 0;
+
+ new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
+
+ /* cmpxchg is a full barrier, so nothing can move before it */
+ return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
}
-#endif

static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index e3ad1e3..72e154e 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -9,8 +9,10 @@

#if (CONFIG_NR_CPUS < 256)
typedef u8 __ticket_t;
+typedef u16 __ticketpair_t;
#else
typedef u16 __ticket_t;
+typedef u32 __ticketpair_t;
#endif

#define TICKET_SHIFT (sizeof(__ticket_t) * 8)
@@ -18,14 +20,14 @@ typedef u16 __ticket_t;

typedef struct arch_spinlock {
union {
- unsigned int slock;
+ __ticketpair_t head_tail;
struct __raw_tickets {
__ticket_t head, tail;
} tickets;
};
} arch_spinlock_t;

-#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } }
+#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }

typedef struct {
unsigned int lock;
--
1.7.5.4

2011-06-16 21:41:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 7/7] x86/ticketlock: prevent memory accesses from reordered out of lock region

From: Jeremy Fitzhardinge <[email protected]>

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3afb1a7..dac6fc6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -114,6 +114,7 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)

static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
+ barrier(); /* prevent reordering out of locked region */
__ticket_unlock_release(lock);
barrier(); /* prevent reordering into locked region */
}
--
1.7.5.4

2011-06-21 14:03:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On Thu, 2011-06-16 at 14:40 -0700, Jeremy Fitzhardinge wrote:
> From: Jeremy Fitzhardinge <[email protected]>
>
> Hi all,
>
> I'm proposing this series for 3[.0].1.
>
> This is a repost of a series to clean up the x86 ticket lock
> code by converting it to a mostly C implementation and removing
> lots of duplicate code relating to the ticket size.
>
> The last time I posted this series, the only significant comments
> were from Nick Piggin, specifically relating to:
>
> 1. A wrongly placed barrier on unlock (which may have allowed the
> compiler to move things out of the locked region. I went
> belt-and-suspenders by having two barriers to prevent motion
> into or out of the locked region.
>
> 2. With NR_CPUS < 256 the ticket size is 8 bits. The compiler doesn't
> use the same trick as the hand-coded asm to directly compare the high
> and low bytes in the word, but does a bit of extra shuffling around.
> However, the Intel optimisation guide and several x86 experts have
> opined that its best to avoid the high-byte operations anyway, since
> they will cause a partial word stall, and the gcc-generated code should
> be better.
>
> Overall the compiler-generated code is very similar to the hand-coded
> versions, with the partial byte operations being the only significant
> difference. (Curiously, gcc does generate a high-byte compare for me
> in trylock, so it can if it wants to.)
>
> I've been running with this code in place for several months on 4 core
> systems without any problems.
>
> I couldn't measure a consistent performance difference between the two
> implemenations; there seemed to be +/- ~1% +/-, which is the level of
> variation I see from simply recompiling the kernel with slightly
> different code alignment.
>
> Overall, I think the large reduction in code size is a big win.

No complaints from me, I rather like the result.

One other thing you could contemplate is adding something like:

#define xadd(ptr, inc) \
do { \
switch(sizeof(*(ptr))) { \
case 1: \
asm volatile (LOCK_PREFIX "xaddb %0, %1\n" \
: "+r" (inc), "+m" (*(ptr)) \
: : "memory", "cc"); \
case 2:
... xaddw ...
case 4:
... xaddl ...
} while (0)

and a similar something for inc. For both there seem to be various asm
bits left (we could even consider adding xadd to
arch/x86/include/asm/cmpxchg*.h).

Also, you might have wanted to CC Linus on this, he's usually interested
in these bits.

2011-06-21 14:22:04

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On Thu, Jun 16, 2011 at 02:40:47PM -0700, Jeremy Fitzhardinge wrote:
> From: Jeremy Fitzhardinge <[email protected]>
>
> Hi all,
>
> I'm proposing this series for 3[.0].1.

Is there a git branch with these patches? I've looked over them
and they look good (and they make the code much easier to read)
but I would like to ingest them in my branches for some extra
testing goodness.

2011-06-21 18:03:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/21/2011 07:01 AM, Peter Zijlstra wrote:
> No complaints from me, I rather like the result.

OK, good.

> One other thing you could contemplate is adding something like:
>
> #define xadd(ptr, inc) \
> do { \
> switch(sizeof(*(ptr))) { \
> case 1: \
> asm volatile (LOCK_PREFIX "xaddb %0, %1\n" \
> : "+r" (inc), "+m" (*(ptr)) \
> : : "memory", "cc"); \
> case 2:
> ... xaddw ...
> case 4:
> ... xaddl ...
> } while (0)
>
> and a similar something for inc. For both there seem to be various asm
> bits left (we could even consider adding xadd to
> arch/x86/include/asm/cmpxchg*.h).
>
> Also, you might have wanted to CC Linus on this, he's usually interested
> in these bits.

OK, I'll add the xadd/inc helpers and repost.

J

2011-06-22 19:40:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/21/2011 07:01 AM, Peter Zijlstra wrote:
> One other thing you could contemplate is adding something like:
>
> #define xadd(ptr, inc) \
> do { \
> switch(sizeof(*(ptr))) { \
> case 1: \
> asm volatile (LOCK_PREFIX "xaddb %0, %1\n" \
> : "+r" (inc), "+m" (*(ptr)) \
> : : "memory", "cc"); \
> case 2:
> ... xaddw ...
> case 4:
> ... xaddl ...
> } while (0)
>
> and a similar something for inc. For both there seem to be various asm
> bits left (we could even consider adding xadd to
> arch/x86/include/asm/cmpxchg*.h).

A friend just pointed out that gcc has a "__sync_fetch_and_add()"
intrinsic, which compiles to xadd when used in this context. What's the
general feeling about using these kinds of gcc features?

It also has __sync_bool_compare_and_swap(), which would simplify a lot
of asm/cmpxchg.h...

J

2011-06-22 20:20:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/22/2011 12:21 PM, Jeremy Fitzhardinge wrote:
>
> A friend just pointed out that gcc has a "__sync_fetch_and_add()"
> intrinsic, which compiles to xadd when used in this context. What's the
> general feeling about using these kinds of gcc features?
>

In general they are good, IF:

a) they cover all versions of gcc we care about (or we have a fallback),
and
b) they have the right semantics.

Using gcc intrinsics can generate better code than we can in inline
assembly.

However, (b) is a killer since gcc doesn't have a way to generate our
lock prefix annotations, and so it isn't really useful here.

-hpa

2011-06-22 20:59:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/22/2011 01:19 PM, H. Peter Anvin wrote:
> On 06/22/2011 12:21 PM, Jeremy Fitzhardinge wrote:
>> A friend just pointed out that gcc has a "__sync_fetch_and_add()"
>> intrinsic, which compiles to xadd when used in this context. What's the
>> general feeling about using these kinds of gcc features?
>>
> In general they are good, IF:
>
> a) they cover all versions of gcc we care about (or we have a fallback),

What is the supported range for these days?

> and
> b) they have the right semantics.

My main concern was making sure that its a strong enough barrier, but
the documentation is pretty explicit about that.

> Using gcc intrinsics can generate better code than we can in inline
> assembly.

It does seem to do a pretty good job; it generates a plain locked add if
you never look at the returned value, for example.

> However, (b) is a killer since gcc doesn't have a way to generate our
> lock prefix annotations, and so it isn't really useful here.

Yeah, I thought about that. Its a bit unfortunate we're getting into
spinlock code at all on a UP system, but we don't have a mechanism to
stomp locking at a higher level. (Ignoring all the insane stuff that
happens when the system becomes UP transiently just because all the
other CPUs have been unplugged for suspend, etc; we just shouldn't
bother in that case.)

J

2011-06-22 21:08:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/22/2011 01:59 PM, Jeremy Fitzhardinge wrote:
> On 06/22/2011 01:19 PM, H. Peter Anvin wrote:
>> On 06/22/2011 12:21 PM, Jeremy Fitzhardinge wrote:
>>> A friend just pointed out that gcc has a "__sync_fetch_and_add()"
>>> intrinsic, which compiles to xadd when used in this context. What's the
>>> general feeling about using these kinds of gcc features?
>>>
>> In general they are good, IF:
>>
>> a) they cover all versions of gcc we care about (or we have a fallback),
>
> What is the supported range for these days?
>

For x86, we support 3.4, 4.0 and 4.1.2 and above (not sure if 4.0.x
actually works). Other architectures have different rules. At some
point we'll probably drop anything below 4.1.2, but we apparently can't yet.

>> and
>> b) they have the right semantics.
>
> My main concern was making sure that its a strong enough barrier, but
> the documentation is pretty explicit about that.
>
>> Using gcc intrinsics can generate better code than we can in inline
>> assembly.
>
> It does seem to do a pretty good job; it generates a plain locked add if
> you never look at the returned value, for example.
>
>> However, (b) is a killer since gcc doesn't have a way to generate our
>> lock prefix annotations, and so it isn't really useful here.
>
> Yeah, I thought about that. Its a bit unfortunate we're getting into
> spinlock code at all on a UP system, but we don't have a mechanism to
> stomp locking at a higher level. (Ignoring all the insane stuff that
> happens when the system becomes UP transiently just because all the
> other CPUs have been unplugged for suspend, etc; we just shouldn't
> bother in that case.)

Yep...

-hpa

2011-06-22 21:35:53

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/22/2011 02:07 PM, H. Peter Anvin wrote:
> For x86, we support 3.4, 4.0 and 4.1.2 and above (not sure if 4.0.x
> actually works). Other architectures have different rules. At some
> point we'll probably drop anything below 4.1.2, but we apparently can't yet.

Looks like the intrinsics turned up in 4.1.0.

J

2011-06-22 23:17:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/22/2011 02:35 PM, Jeremy Fitzhardinge wrote:
> On 06/22/2011 02:07 PM, H. Peter Anvin wrote:
>> For x86, we support 3.4, 4.0 and 4.1.2 and above (not sure if 4.0.x
>> actually works). Other architectures have different rules. At some
>> point we'll probably drop anything below 4.1.2, but we apparently can't yet.
>
> Looks like the intrinsics turned up in 4.1.0.
>

Yes, but they still don't do what we want...

-hpa

2011-06-24 01:19:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 1/8] x86/ticketlock: clean up types and accessors

From: Jeremy Fitzhardinge <[email protected]>

A few cleanups to the way spinlocks are defined and accessed:
- define __ticket_t which is the size of a spinlock ticket (ie, enough
bits to hold all the cpus)
- Define struct arch_spinlock as a union containing plain slock and
the head and tail tickets
- Use head and tail to implement some of the spinlock predicates.
- Make all ticket variables unsigned.
- Use TICKET_SHIFT to form constants

Most of this will be used in later patches.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 24 ++++++++++--------------
arch/x86/include/asm/spinlock_types.h | 20 ++++++++++++++++++--
2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3089f70..d6d5784 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -56,11 +56,9 @@
* much between them in performance though, especially as locks are out of line.
*/
#if (NR_CPUS < 256)
-#define TICKET_SHIFT 8
-
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- short inc = 0x0100;
+ unsigned short inc = 1 << TICKET_SHIFT;

asm volatile (
LOCK_PREFIX "xaddw %w0, %1\n"
@@ -79,7 +77,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- int tmp, new;
+ unsigned int tmp, new;

asm volatile("movzwl %2, %0\n\t"
"cmpb %h0,%b0\n\t"
@@ -104,12 +102,10 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
: "memory", "cc");
}
#else
-#define TICKET_SHIFT 16
-
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- int inc = 0x00010000;
- int tmp;
+ unsigned inc = 1 << TICKET_SHIFT;
+ unsigned tmp;

asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
"movzwl %w0, %2\n\t"
@@ -129,8 +125,8 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- int tmp;
- int new;
+ unsigned tmp;
+ unsigned new;

asm volatile("movl %2,%0\n\t"
"movl %0,%1\n\t"
@@ -160,16 +156,16 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)

static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
- int tmp = ACCESS_ONCE(lock->slock);
+ struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);

- return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+ return !!(tmp.tail ^ tmp.head);
}

static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
{
- int tmp = ACCESS_ONCE(lock->slock);
+ struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);

- return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+ return ((tmp.tail - tmp.head) & TICKET_MASK) > 1;
}

#ifndef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index dcb48b2..e3ad1e3 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -5,11 +5,27 @@
# error "please don't include this file directly"
#endif

+#include <linux/types.h>
+
+#if (CONFIG_NR_CPUS < 256)
+typedef u8 __ticket_t;
+#else
+typedef u16 __ticket_t;
+#endif
+
+#define TICKET_SHIFT (sizeof(__ticket_t) * 8)
+#define TICKET_MASK ((__ticket_t)((1 << TICKET_SHIFT) - 1))
+
typedef struct arch_spinlock {
- unsigned int slock;
+ union {
+ unsigned int slock;
+ struct __raw_tickets {
+ __ticket_t head, tail;
+ } tickets;
+ };
} arch_spinlock_t;

-#define __ARCH_SPIN_LOCK_UNLOCKED { 0 }
+#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } }

typedef struct {
unsigned int lock;
--
1.7.5.4

2011-06-24 01:19:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 2/8] x86/ticketlock: convert spin loop to C

From: Jeremy Fitzhardinge <[email protected]>

The inner loop of __ticket_spin_lock isn't doing anything very special,
so reimplement it in C.

For the 8 bit ticket lock variant, we use a register union to get direct
access to the lower and upper bytes in the tickets, but unfortunately gcc
won't generate a direct comparison between the two halves of the register,
so the generated asm isn't quite as pretty as the hand-coded version.
However benchmarking shows that this is actually a small improvement in
runtime performance on some benchmarks, and never a slowdown.

We also need to make sure there's a barrier at the end of the lock loop
to make sure that the compiler doesn't move any instructions from within
the locked region into the region where we don't yet own the lock.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 58 +++++++++++++++++++-------------------
1 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index d6d5784..f48a6e3 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -58,21 +58,21 @@
#if (NR_CPUS < 256)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- unsigned short inc = 1 << TICKET_SHIFT;
-
- asm volatile (
- LOCK_PREFIX "xaddw %w0, %1\n"
- "1:\t"
- "cmpb %h0, %b0\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movb %1, %b0\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- : "+Q" (inc), "+m" (lock->slock)
- :
- : "memory", "cc");
+ register union {
+ struct __raw_tickets tickets;
+ unsigned short slock;
+ } inc = { .slock = 1 << TICKET_SHIFT };
+
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
+ : "+Q" (inc), "+m" (lock->slock) : : "memory", "cc");
+
+ for (;;) {
+ if (inc.tickets.head == inc.tickets.tail)
+ goto out;
+ cpu_relax();
+ inc.tickets.head = ACCESS_ONCE(lock->tickets.head);
+ }
+out: barrier(); /* make sure nothing creeps before the lock is taken */
}

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
@@ -105,22 +105,22 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
unsigned inc = 1 << TICKET_SHIFT;
- unsigned tmp;
+ __ticket_t tmp;

- asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
- "movzwl %w0, %2\n\t"
- "shrl $16, %0\n\t"
- "1:\t"
- "cmpl %0, %2\n\t"
- "je 2f\n\t"
- "rep ; nop\n\t"
- "movzwl %1, %2\n\t"
- /* don't need lfence here, because loads are in-order */
- "jmp 1b\n"
- "2:"
- : "+r" (inc), "+m" (lock->slock), "=&r" (tmp)
- :
- : "memory", "cc");
+ asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
+ : "+r" (inc), "+m" (lock->slock)
+ : : "memory", "cc");
+
+ tmp = inc;
+ inc >>= TICKET_SHIFT;
+
+ for (;;) {
+ if ((__ticket_t)inc == tmp)
+ goto out;
+ cpu_relax();
+ tmp = ACCESS_ONCE(lock->tickets.head);
+ }
+out: barrier(); /* make sure nothing creeps before the lock is taken */
}

static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
--
1.7.5.4

2011-06-24 01:19:29

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock

From: Jeremy Fitzhardinge <[email protected]>

If we don't need to use a locked inc for unlock, then implement it in C.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 33 ++++++++++++++++++---------------
1 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index f48a6e3..704b0c3 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -33,9 +33,21 @@
* On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
* (PPro errata 66, 92)
*/
-# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
+{
+ if (sizeof(lock->tickets.head) == sizeof(u8))
+ asm (LOCK_PREFIX "incb %0"
+ : "+m" (lock->tickets.head) : : "memory");
+ else
+ asm (LOCK_PREFIX "incw %0"
+ : "+m" (lock->tickets.head) : : "memory");
+
+}
#else
-# define UNLOCK_LOCK_PREFIX
+static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
+{
+ lock->tickets.head++;
+}
#endif

/*
@@ -93,14 +105,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)

return tmp;
}
-
-static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
-{
- asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
- : "+m" (lock->slock)
- :
- : "memory", "cc");
-}
#else
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
@@ -144,15 +148,14 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)

return tmp;
}
+#endif

static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
- asm volatile(UNLOCK_LOCK_PREFIX "incw %0"
- : "+m" (lock->slock)
- :
- : "memory", "cc");
+ barrier(); /* prevent reordering out of locked region */
+ __ticket_unlock_release(lock);
+ barrier(); /* prevent reordering into locked region */
}
-#endif

static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
--
1.7.5.4

2011-06-24 01:19:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 4/8] x86/ticketlock: make large and small ticket versions of spin_lock the same

From: Jeremy Fitzhardinge <[email protected]>

Make the bulk of __ticket_spin_lock look identical for large and small
number of cpus.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 23 ++++++++---------------
1 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 704b0c3..694b8a0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -70,19 +70,16 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
#if (NR_CPUS < 256)
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- register union {
- struct __raw_tickets tickets;
- unsigned short slock;
- } inc = { .slock = 1 << TICKET_SHIFT };
+ register struct __raw_tickets inc = { .tail = 1 };

asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
- : "+Q" (inc), "+m" (lock->slock) : : "memory", "cc");
+ : "+r" (inc), "+m" (lock->tickets) : : "memory", "cc");

for (;;) {
- if (inc.tickets.head == inc.tickets.tail)
+ if (inc.head == inc.tail)
goto out;
cpu_relax();
- inc.tickets.head = ACCESS_ONCE(lock->tickets.head);
+ inc.head = ACCESS_ONCE(lock->tickets.head);
}
out: barrier(); /* make sure nothing creeps before the lock is taken */
}
@@ -108,21 +105,17 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
#else
static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
- unsigned inc = 1 << TICKET_SHIFT;
- __ticket_t tmp;
+ register struct __raw_tickets inc = { .tail = 1 };

asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
- : "+r" (inc), "+m" (lock->slock)
+ : "+r" (inc), "+m" (lock->tickets)
: : "memory", "cc");

- tmp = inc;
- inc >>= TICKET_SHIFT;
-
for (;;) {
- if ((__ticket_t)inc == tmp)
+ if (inc.head == inc.tail)
goto out;
cpu_relax();
- tmp = ACCESS_ONCE(lock->tickets.head);
+ inc.head = ACCESS_ONCE(lock->tickets.head);
}
out: barrier(); /* make sure nothing creeps before the lock is taken */
}
--
1.7.5.4

2011-06-24 01:19:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 5/8] x86/ticketlock: make __ticket_spin_lock common

From: Jeremy Fitzhardinge <[email protected]>

Aside from the particular form of the xadd instruction, they're identical.
So factor out the xadd and use common code for the rest.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 42 ++++++++++++++++++--------------------
1 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 694b8a0..0b8d2b2 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -67,13 +67,27 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
* save some instructions and make the code more elegant. There really isn't
* much between them in performance though, especially as locks are out of line.
*/
-#if (NR_CPUS < 256)
-static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
+static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock)
{
- register struct __raw_tickets inc = { .tail = 1 };
+ register struct __raw_tickets tickets = { .tail = 1 };
+
+ if (sizeof(lock->tickets.head) == sizeof(u8))
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
+ : "+r" (tickets), "+m" (lock->tickets)
+ : : "memory", "cc");
+ else
+ asm volatile (LOCK_PREFIX "xaddl %0, %1\n"
+ : "+r" (tickets), "+m" (lock->tickets)
+ : : "memory", "cc");

- asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
- : "+r" (inc), "+m" (lock->tickets) : : "memory", "cc");
+ return tickets;
+}
+
+static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
+{
+ register struct __raw_tickets inc;
+
+ inc = __ticket_spin_claim(lock);

for (;;) {
if (inc.head == inc.tail)
@@ -84,6 +98,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
out: barrier(); /* make sure nothing creeps before the lock is taken */
}

+#if (NR_CPUS < 256)
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
unsigned int tmp, new;
@@ -103,23 +118,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
return tmp;
}
#else
-static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
-{
- register struct __raw_tickets inc = { .tail = 1 };
-
- asm volatile(LOCK_PREFIX "xaddl %0, %1\n\t"
- : "+r" (inc), "+m" (lock->tickets)
- : : "memory", "cc");
-
- for (;;) {
- if (inc.head == inc.tail)
- goto out;
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
- }
-out: barrier(); /* make sure nothing creeps before the lock is taken */
-}
-
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
unsigned tmp;
--
1.7.5.4

2011-06-24 01:20:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 6/8] x86/ticketlock: make __ticket_spin_trylock common

From: Jeremy Fitzhardinge <[email protected]>

Make trylock code common regardless of ticket size.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 49 +++++++--------------------------
arch/x86/include/asm/spinlock_types.h | 6 +++-
2 files changed, 14 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0b8d2b2..dac6fc6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -98,48 +98,19 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
out: barrier(); /* make sure nothing creeps before the lock is taken */
}

-#if (NR_CPUS < 256)
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
{
- unsigned int tmp, new;
-
- asm volatile("movzwl %2, %0\n\t"
- "cmpb %h0,%b0\n\t"
- "leal 0x100(%" REG_PTR_MODE "0), %1\n\t"
- "jne 1f\n\t"
- LOCK_PREFIX "cmpxchgw %w1,%2\n\t"
- "1:"
- "sete %b1\n\t"
- "movzbl %b1,%0\n\t"
- : "=&a" (tmp), "=&q" (new), "+m" (lock->slock)
- :
- : "memory", "cc");
-
- return tmp;
-}
-#else
-static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
-{
- unsigned tmp;
- unsigned new;
-
- asm volatile("movl %2,%0\n\t"
- "movl %0,%1\n\t"
- "roll $16, %0\n\t"
- "cmpl %0,%1\n\t"
- "leal 0x00010000(%" REG_PTR_MODE "0), %1\n\t"
- "jne 1f\n\t"
- LOCK_PREFIX "cmpxchgl %1,%2\n\t"
- "1:"
- "sete %b1\n\t"
- "movzbl %b1,%0\n\t"
- : "=&a" (tmp), "=&q" (new), "+m" (lock->slock)
- :
- : "memory", "cc");
-
- return tmp;
+ arch_spinlock_t old, new;
+
+ old.tickets = ACCESS_ONCE(lock->tickets);
+ if (old.tickets.head != old.tickets.tail)
+ return 0;
+
+ new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
+
+ /* cmpxchg is a full barrier, so nothing can move before it */
+ return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
}
-#endif

static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index e3ad1e3..72e154e 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -9,8 +9,10 @@

#if (CONFIG_NR_CPUS < 256)
typedef u8 __ticket_t;
+typedef u16 __ticketpair_t;
#else
typedef u16 __ticket_t;
+typedef u32 __ticketpair_t;
#endif

#define TICKET_SHIFT (sizeof(__ticket_t) * 8)
@@ -18,14 +20,14 @@ typedef u16 __ticket_t;

typedef struct arch_spinlock {
union {
- unsigned int slock;
+ __ticketpair_t head_tail;
struct __raw_tickets {
__ticket_t head, tail;
} tickets;
};
} arch_spinlock_t;

-#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } }
+#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }

typedef struct {
unsigned int lock;
--
1.7.5.4

2011-06-24 01:20:58

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 7/8] x86: add xadd helper macro

From: Jeremy Fitzhardinge <[email protected]>

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/cmpxchg_32.h | 21 +++++++++++++++++++++
arch/x86/include/asm/cmpxchg_64.h | 26 ++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index 284a6e8..30f0318 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -280,4 +280,25 @@ static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,

#endif

+#define xadd(ptr, inc) \
+ do { \
+ switch (sizeof(*(ptr))) { \
+ case 1: \
+ asm volatile (LOCK_PREFIX "xaddb %b0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ case 2: \
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ case 4: \
+ asm volatile (LOCK_PREFIX "xaddl %0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ } \
+ } while(0)
+
#endif /* _ASM_X86_CMPXCHG_32_H */
diff --git a/arch/x86/include/asm/cmpxchg_64.h b/arch/x86/include/asm/cmpxchg_64.h
index 423ae58..62da1ff 100644
--- a/arch/x86/include/asm/cmpxchg_64.h
+++ b/arch/x86/include/asm/cmpxchg_64.h
@@ -151,4 +151,30 @@ extern void __cmpxchg_wrong_size(void);
cmpxchg_local((ptr), (o), (n)); \
})

+#define xadd(ptr, inc) \
+ do { \
+ switch (sizeof(*(ptr))) { \
+ case 1: \
+ asm volatile (LOCK_PREFIX "xaddb %b0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ case 2: \
+ asm volatile (LOCK_PREFIX "xaddw %w0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ case 4: \
+ asm volatile (LOCK_PREFIX "xaddl %0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ case 8: \
+ asm volatile (LOCK_PREFIX "xaddq %q0, %1\n" \
+ : "+r" (inc), "+m" (*(ptr)) \
+ : : "memory", "cc"); \
+ break; \
+ } \
+ } while(0)
+
#endif /* _ASM_X86_CMPXCHG_64_H */
--
1.7.5.4

2011-06-24 01:20:56

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 8/8] x86/ticketlock: use xadd helper

From: Jeremy Fitzhardinge <[email protected]>

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/spinlock.h | 9 +--------
1 files changed, 1 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index dac6fc6..2d14e7c 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -71,14 +71,7 @@ static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spin
{
register struct __raw_tickets tickets = { .tail = 1 };

- if (sizeof(lock->tickets.head) == sizeof(u8))
- asm volatile (LOCK_PREFIX "xaddw %w0, %1\n"
- : "+r" (tickets), "+m" (lock->tickets)
- : : "memory", "cc");
- else
- asm volatile (LOCK_PREFIX "xaddl %0, %1\n"
- : "+r" (tickets), "+m" (lock->tickets)
- : : "memory", "cc");
+ xadd(&lock->tickets, tickets);

return tickets;
}
--
1.7.5.4

2011-06-24 21:51:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/23/2011 06:19 PM, Jeremy Fitzhardinge wrote:
>
> I've been running with this code in place for several months on 4 core
> systems without any problems.
>
> I couldn't measure a consistent performance difference between the two
> implemenations; there seemed to be +/- ~1% +/-, which is the level of
> variation I see from simply recompiling the kernel with slightly
> different code alignment.
>
> Overall, I think the large reduction in code size is a big win.
>

Could you give us the delta in *compiled* code size?

-hpa

2011-06-24 21:52:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock

On 06/23/2011 06:19 PM, Jeremy Fitzhardinge wrote:
> From: Jeremy Fitzhardinge <[email protected]>
>
> If we don't need to use a locked inc for unlock, then implement it in C.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/include/asm/spinlock.h | 33 ++++++++++++++++++---------------
> 1 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index f48a6e3..704b0c3 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -33,9 +33,21 @@
> * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
> * (PPro errata 66, 92)
> */
> -# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
> +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock)
> +{
> + if (sizeof(lock->tickets.head) == sizeof(u8))
> + asm (LOCK_PREFIX "incb %0"
> + : "+m" (lock->tickets.head) : : "memory");
> + else
> + asm (LOCK_PREFIX "incw %0"
> + : "+m" (lock->tickets.head) : : "memory");
> +

These should be asm volatile, I believe.

-hpa

2011-06-25 02:03:56

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock

On 06/24/2011 02:52 PM, H. Peter Anvin wrote:
> These should be asm volatile, I believe.

OK.

J

2011-06-25 02:03:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

On 06/24/2011 02:50 PM, H. Peter Anvin wrote:
> On 06/23/2011 06:19 PM, Jeremy Fitzhardinge wrote:
>> I've been running with this code in place for several months on 4 core
>> systems without any problems.
>>
>> I couldn't measure a consistent performance difference between the two
>> implemenations; there seemed to be +/- ~1% +/-, which is the level of
>> variation I see from simply recompiling the kernel with slightly
>> different code alignment.
>>
>> Overall, I think the large reduction in code size is a big win.
>>
> Could you give us the delta in *compiled* code size?

Sure. Do you mean for the individual lock sequences, or for the overall
kernel?

J

2011-06-25 03:16:39

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

Jeremy Fitzhardinge <[email protected]> wrote:

>On 06/24/2011 02:50 PM, H. Peter Anvin wrote:
>> On 06/23/2011 06:19 PM, Jeremy Fitzhardinge wrote:
>>> I've been running with this code in place for several months on 4
>core
>>> systems without any problems.
>>>
>>> I couldn't measure a consistent performance difference between the
>two
>>> implemenations; there seemed to be +/- ~1% +/-, which is the level
>of
>>> variation I see from simply recompiling the kernel with slightly
>>> different code alignment.
>>>
>>> Overall, I think the large reduction in code size is a big win.
>>>
>> Could you give us the delta in *compiled* code size?
>
>Sure. Do you mean for the individual lock sequences, or for the
>overall
>kernel?
>
> J

Overall is fine.
--
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

2011-06-25 10:12:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code


* Jeremy Fitzhardinge <[email protected]> wrote:

> 2. With NR_CPUS < 256 the ticket size is 8 bits. The compiler doesn't
> use the same trick as the hand-coded asm to directly compare the high
> and low bytes in the word, but does a bit of extra shuffling around.
> However, the Intel optimisation guide and several x86 experts have
> opined that its best to avoid the high-byte operations anyway, since
> they will cause a partial word stall, and the gcc-generated code should
> be better.
>
> Overall the compiler-generated code is very similar to the hand-coded
> versions, with the partial byte operations being the only significant
> difference. (Curiously, gcc does generate a high-byte compare for me
> in trylock, so it can if it wants to.)
>
> I've been running with this code in place for several months on 4 core
> systems without any problems.

Please do measurements both in terms of disassembly based instruction
count(s) in the fastpath(s) (via looking at the before/after
disassembly) and actual cycle, instruction and branch counts (via
perf measurements).

> I couldn't measure a consistent performance difference between the two
> implemenations; there seemed to be +/- ~1% +/-, which is the level of
> variation I see from simply recompiling the kernel with slightly
> different code alignment.

Then you've done the micro-cost measurements the wrong way - we can
and do detect much finer effects than 1%, see the methods used in
this commit for example:

c8b281161dfa: sched: Increase SCHED_LOAD_SCALE resolution

Please also ensure that the cold-cache behavior is fairly measured
via hot-cache benchmarks (that is not always guaranteed).

Thanks,

Ingo

2011-06-29 20:45:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code

Jeremy Fitzhardinge <[email protected]> writes:
>
> I couldn't measure a consistent performance difference between the two
> implemenations; there seemed to be +/- ~1% +/-, which is the level of
> variation I see from simply recompiling the kernel with slightly
> different code alignment.

I ran your new locks in my lock tester and I have a similar experience.
There's some variation, but it seems to be in the usual variance.
In some cases the C locks were actually faster.

-Andi

--
[email protected] -- Speaking for myself only