2014-04-02 13:28:22

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

N.B. Sorry for the duplicate. This patch series were resent as the
original one was rejected by the vger.kernel.org list server
due to long header. There is no change in content.

v7->v8:
- Remove one unneeded atomic operation from the slowpath, thus
improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.

v6->v7:
- Remove an atomic operation from the 2-task contending code
- Shorten the names of some macros
- Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
- Remove lock holder kick from the PV code and fix a race condition
- Run the unfair lock & PV code on overcommitted KVM guests to collect
performance data.

v5->v6:
- Change the optimized 2-task contending code to make it fairer at the
expense of a bit of performance.
- Add a patch to support unfair queue spinlock for Xen.
- Modify the PV qspinlock code to follow what was done in the PV
ticketlock.
- Add performance data for the unfair lock as well as the PV
support code.

v4->v5:
- Move the optimized 2-task contending code to the generic file to
enable more architectures to use it without code duplication.
- Address some of the style-related comments by PeterZ.
- Allow the use of unfair queue spinlock in a real para-virtualized
execution environment.
- Add para-virtualization support to the qspinlock code by ensuring
that the lock holder and queue head stay alive as much as possible.

v3->v4:
- Remove debugging code and fix a configuration error
- Simplify the qspinlock structure and streamline the code to make it
perform a bit better
- Add an x86 version of asm/qspinlock.h for holding x86 specific
optimization.
- Add an optimized x86 code path for 2 contending tasks to improve
low contention performance.

v2->v3:
- Simplify the code by using numerous mode only without an unfair option.
- Use the latest smp_load_acquire()/smp_store_release() barriers.
- Move the queue spinlock code to kernel/locking.
- Make the use of queue spinlock the default for x86-64 without user
configuration.
- Additional performance tuning.

v1->v2:
- Add some more comments to document what the code does.
- Add a numerous CPU mode to support >= 16K CPUs
- Add a configuration option to allow lock stealing which can further
improve performance in many cases.
- Enable wakeup of queue head CPU at unlock time for non-numerous
CPU mode.

This patch set has 3 different sections:
1) Patches 1-4: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
2) Patches 5-6: Enables the use of unfair queue spinlock in a
para-virtualized execution environment. This can resolve some
of the locking related performance issues due to the fact that
the next CPU to get the lock may have been scheduled out for a
period of time.
3) Patches 7-10: Enable qspinlock para-virtualization support
by halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention. This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.

To illustrate the performance benefit of the queue spinlock, the
ebizzy benchmark was run with the -m option in two different computers:

Test machine ticket-lock queue-lock
------------ ----------- ----------
4-socket 40-core 2316 rec/s 2899 rec/s
Westmere-EX (HT off)
2-socket 12-core 2130 rec/s 2176 rec/s
Westmere-EP (HT on)

Waiman Long (10):
qspinlock: A generic 4-byte queue spinlock implementation
qspinlock, x86: Enable x86-64 to use queue spinlock
qspinlock: More optimized code for smaller NR_CPUS
qspinlock: Optimized code path for 2 contending tasks
pvqspinlock, x86: Allow unfair spinlock in a PV guest
pvqspinlock: Enable lock stealing in queue lock waiters
pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
pvqspinlock, x86: Add qspinlock para-virtualization support
pvqspinlock, x86: Enable qspinlock PV support for KVM
pvqspinlock, x86: Enable qspinlock PV support for XEN

arch/x86/Kconfig | 12 +
arch/x86/include/asm/paravirt.h | 17 +-
arch/x86/include/asm/paravirt_types.h | 16 +
arch/x86/include/asm/pvqspinlock.h | 260 +++++++++
arch/x86/include/asm/qspinlock.h | 191 +++++++
arch/x86/include/asm/spinlock.h | 9 +-
arch/x86/include/asm/spinlock_types.h | 4 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/kvm.c | 113 ++++-
arch/x86/kernel/paravirt-spinlocks.c | 36 ++-
arch/x86/xen/spinlock.c | 121 ++++-
include/asm-generic/qspinlock.h | 126 ++++
include/asm-generic/qspinlock_types.h | 63 ++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qspinlock.c | 1010 +++++++++++++++++++++++++++++++++
16 files changed, 1975 insertions(+), 12 deletions(-)
create mode 100644 arch/x86/include/asm/pvqspinlock.h
create mode 100644 arch/x86/include/asm/qspinlock.h
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 kernel/locking/qspinlock.c


2014-04-02 13:28:28

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 03/10] qspinlock: More optimized code for smaller NR_CPUS

For architectures that support atomic operations on smaller 8 or
16 bits data types. It is possible to simplify the code and produce
slightly better optimized code at the expense of smaller number of
supported CPUs.

The qspinlock code can support up to a maximum of 4M-1 CPUs. With
less than 16K CPUs, it is possible to squeeze the queue code into a
2-byte short word which can be accessed directly as a 16-bit short
data type. This enables the simplification of the queue code exchange
portion of the slowpath code.

This patch introduces a new macro _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
which can now be defined in an architecture specific qspinlock.h header
file to indicate its support for smaller atomic operation data types.
This macro triggers the replacement of some of the generic functions
by more optimized versions.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 34 +++++++++++-
include/asm-generic/qspinlock.h | 8 ++-
include/asm-generic/qspinlock_types.h | 20 ++++++-
kernel/locking/qspinlock.c | 95 +++++++++++++++++++++++++++++++++
4 files changed, 151 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 44cefee..f058b91 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -8,11 +8,23 @@
#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS

/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET 0
+
+/*
* x86-64 specific queue spinlock union structure
+ * Besides the slock and lock fields, the other fields are only
+ * valid with less than 16K CPUs.
*/
union arch_qspinlock {
struct qspinlock slock;
- u8 lock; /* Lock bit */
+ struct {
+ u8 lock; /* Lock bit */
+ u8 reserved;
+ u16 qcode; /* Queue code */
+ };
+ u32 qlcode; /* Complete lock word */
};

#define queue_spin_unlock queue_spin_unlock
@@ -34,6 +46,26 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
barrier();
}

+#ifdef _QCODE_SHORT
+#define __queue_spin_trylock __queue_spin_trylock
+/**
+ * __queue_spin_trylock - acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: Always return 1
+ *
+ * This routine should only be called when the caller is the only one
+ * entitled to acquire the lock. No lock stealing is allowed.
+ */
+static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ barrier();
+ ACCESS_ONCE(qlock->lock) = _QLOCK_LOCKED;
+ barrier();
+ return 1;
+}
+#endif /* _QCODE_SHORT */
#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */

#include <asm-generic/qspinlock.h>
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 8525931..f47d19e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -32,17 +32,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
*/
static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
{
- return atomic_read(&lock->qlcode) & _QLOCK_LOCK_MASK;
+ return atomic_read(&lock->qlcode);
}

/**
* queue_spin_value_unlocked - is the spinlock structure unlocked?
* @lock: queue spinlock structure
* Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ * locked wrt the lockref code to avoid lock stealing by the lockref
+ * code and change things underneath the lock.
*/
static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
{
- return !(atomic_read(&lock.qlcode) & _QLOCK_LOCK_MASK);
+ return !atomic_read(&lock.qlcode);
}

/**
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index fbfe898..5547aa7 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -33,17 +33,31 @@
/*
* The queue spinlock data structure - a 32-bit word
*
- * The bits assignment are:
+ * For NR_CPUS >= 16K, the bits assignment are:
* Bit 0 : Set if locked
* Bits 1-7 : Not used
* Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-15: Reserved for architecture specific optimization
+ * Bits 16-31: Queue code
*/
typedef struct qspinlock {
atomic_t qlcode; /* Lock + queue code */
} arch_spinlock_t;

-#define _QCODE_OFFSET 8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _QCODE_LONG /* 24-bit queue code */
+# define _QCODE_OFFSET 8
+# define _QLOCK_LOCK_MASK 0xff
+#else
+# define _QCODE_SHORT /* 16-bit queue code */
+# define _QCODE_OFFSET 16
+# define _QLOCK_LOCK_MASK 0xffff
+#endif
+
#define _QLOCK_LOCKED 1U
-#define _QLOCK_LOCK_MASK 0xff

#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 92ed540..45c68a4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -62,6 +62,10 @@
* Bits 0-1 : queue node index (4 nodes)
* Bits 2-23: CPU number + 1 (4M - 1 CPUs)
*
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1 (16K - 1 CPUs)
+ *
* A queue node code of 0 indicates that no one is waiting for the lock.
* As the value 0 cannot be used as a valid CPU number. We need to add
* 1 to it before putting it into the queue code.
@@ -104,6 +108,97 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };

/*
************************************************************************
+ * The following optimized codes are for architectures that support: *
+ * 1) Atomic byte and short data write *
+ * 2) Byte and short data exchange and compare-exchange instructions *
+ * *
+ * For those architectures, their asm/qspinlock.h header file should *
+ * define the followings in order to use the optimized codes. *
+ * 1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro *
+ * 2) A "union arch_qspinlock" structure that include the individual *
+ * fields of the qspinlock structure, including: *
+ * o slock - the qspinlock structure *
+ * o lock - the lock byte *
+ * o qcode - the queue node code *
+ * o qlcode - the 32-bit qspinlock word *
+ * *
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+#ifdef _QCODE_SHORT
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * architectures that allows atomic 8/16 bit operations:
+ * 1) The 16-bit queue code can be accessed or modified directly as a
+ * 16-bit short value without disturbing the first 2 bytes.
+ */
+#define queue_encode_qcode(cpu, idx) (((cpu) + 1) << 2 | (idx))
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: NORMAL_EXIT is always returned
+ */
+static inline enum exitval
+queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ *ocode = xchg(&qlock->qcode, (u16)ncode);
+ return NORMAL_EXIT;
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ qcode <<= _QCODE_OFFSET;
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QLOCK_LOCKED) == qcode;
+}
+
+#define qsval_to_qcode qsval_to_qcode
+/**
+ * qsval_to_qcode - Convert a queue spinlock value to a queue code
+ * @qsval : Queue spinlock value
+ * Return : The corresponding queue code value
+ */
+static inline u32
+qsval_to_qcode(int qsval)
+{
+ return (u32)(qsval >> _QCODE_OFFSET);
+}
+#endif /* _QCODE_SHORT */
+
+#ifndef __queue_spin_trylock
+#define __queue_spin_trylock __queue_spin_trylock
+/**
+ * __queue_spin_trylock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ *
+ * This is an unfair version of the trylock which should only be called
+ * by a caller who is entitled to acquire the lock.
+ */
+static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0;
+}
+#endif
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
+
+/*
+ ************************************************************************
* Inline functions used by the queue_spin_lock_slowpath() function *
* that may get superseded by a more optimized version. *
************************************************************************
--
1.7.1

2014-04-02 13:28:36

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 08/10] pvqspinlock, x86: Add qspinlock para-virtualization support

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter, unlike the
other waiter, will spins 2*QSPIN_THRESHOLD times before halting
itself. Before being halted, the queue head waiter will set a flag
(_QLOCK_LOCKED_SLOWPATH) in the lock byte to indicate that the unlock
slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the queue
node structure until it finds one that has the qhead flag turned
on. It then attempt to kick the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _QLOCK_LOCKED_SLOWPATH if it has been halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

# of Ticket lock Queue lock
tasks PV off/PV on/%Change PV off/PV on/%Change
------ -------------------- ---------------------
1 135/ 179/+33% 137/ 169/+23%
2 1045/ 1103/ +6% 964/ 1137/+18%
3 1827/ 2683/+47% 2228/ 2537/+14%
4 2689/ 4191/+56% 2769/ 3097/+12%
5 3736/ 5830/+56% 3447/ 3568/ +4%
6 4942/ 7609/+54% 4169/ 4292/ +3%
7 6304/ 9570/+52% 4898/ 5021/ +3%
8 7736/11323/+46% 5620/ 5717/ +2%

The big reduction in performance with 2 contending tasks for the PV
queue spinlock is due to the switching off of the optimized code path
when PV spinlock code is turned on.

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The queue
spinlock PV code, on the other hand, only has a relatively minor drop
in performance for 3 or more contending tasks. At 5 or more contending
tasks, there is practically no difference in performance. When coupled
with unfair lock, the queue spinlock can be much faster than the PV
ticket lock.

When both the unfair lock and PV spinlock features is turned on,
lock stealing will still be allowed in the fastpath, but not in
the slowpath.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/paravirt.h | 17 ++-
arch/x86/include/asm/paravirt_types.h | 16 ++
arch/x86/include/asm/pvqspinlock.h | 260 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/qspinlock.h | 35 +++++
arch/x86/kernel/paravirt-spinlocks.c | 6 +
kernel/locking/qspinlock.c | 138 +++++++++++++++++-
6 files changed, 465 insertions(+), 7 deletions(-)
create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..a35cd02 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,22 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
}

#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu)
+{
+ PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu);
+}
+
+static __always_inline void __queue_hibernate(enum pv_lock_stats type)
+{
+ PVOP_VCALL1(pv_lock_ops.hibernate, type);
+}

+static __always_inline void __queue_lockstat(enum pv_lock_stats type)
+{
+ PVOP_VCALL1(pv_lock_ops.lockstat, type);
+}
+#else
static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
{
@@ -723,7 +738,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
{
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
}
-
+#endif
#endif

#ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..a8564b9 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,25 @@ struct arch_spinlock;
typedef u16 __ticket_t;
#endif

+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_lock_stats {
+ PV_HALT_QHEAD, /* Queue head halting */
+ PV_HALT_QNODE, /* Other queue node halting */
+ PV_WAKE_KICKED, /* Wakeup by kicking */
+ PV_WAKE_SPURIOUS, /* Spurious wakeup */
+ PV_KICK_NOHALT /* Kick but CPU not halted */
+};
+#endif
+
struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+ void (*kick_cpu)(int cpu);
+ void (*hibernate)(enum pv_lock_stats type);
+ void (*lockstat)(enum pv_lock_stats type);
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
};

/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 0000000..a632dcb
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,260 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ * Queue Spinlock Para-Virtualization (PV) Support
+ *
+ * +------+ +-----+ next +----+
+ * | Lock | |Queue|----------->|Next|
+ * |Holder|<-----------|Head |<-----------|Node|
+ * +------+ prev_qcode +-----+ prev_qcode +----+
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock. Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to halt so
+ * that the hypervisor can reuse the CPU cycles in some other guests.
+ *
+ * A major difference between the two versions of PV support is the fact
+ * that the queue head will spin twice as long as the other nodes before it
+ * puts itself to halt.
+ *
+ * There are 2 places where race can happen:
+ * 1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU
+ * kicking by the lock holder (in pv_kick_node).
+ * 2) Halting of the queue node CPU (in pv_queue_spin_check) and the
+ * the status check by the previous queue head (in pv_next_node_check).
+ * See the comments on those functions to see how the races are being
+ * addressed.
+ */
+
+/*
+ * Spin threshold for queue spinlock
+ * This is half of the ticket lock's SPIN_THRESHOLD. The queue head will
+ * be halted after 2*QSPIN_THRESHOLD whereas the other nodes will be
+ * halted after QSPIN_THRESHOLD.
+ */
+#define QSPIN_THRESHOLD (1U<<14)
+
+/*
+ * CPU state flags
+ */
+#define PV_CPU_ACTIVE 1 /* This CPU is active */
+#define PV_CPU_KICKED 2 /* This CPU is being kicked */
+#define PV_CPU_HALTED -1 /* This CPU is halted */
+
+/*
+ * Additional fields to be added to the qnode structure
+ */
+#if CONFIG_NR_CPUS >= (1 << 16)
+#define _cpuid_t u32
+#else
+#define _cpuid_t u16
+#endif
+
+struct qnode;
+
+struct pv_qvars {
+ s8 cpustate; /* CPU status flag */
+ s8 qhead; /* Becoming queue head */
+ _cpuid_t mycpu; /* CPU number of this node */
+ struct qnode *prev; /* Pointer to previous node */
+};
+
+/*
+ * Macro to be used by the unfair lock code to access the previous node pointer
+ * in the pv structure.
+ */
+#define qprev pv.prev
+
+/**
+ * pv_init_vars - initialize fields in struct pv_qvars
+ * @pv : pointer to struct pv_qvars
+ * @cpu: current CPU number
+ */
+static __always_inline void pv_init_vars(struct pv_qvars *pv, int cpu)
+{
+ pv->cpustate = PV_CPU_ACTIVE;
+ pv->prev = NULL;
+ pv->qhead = false;
+ pv->mycpu = cpu;
+}
+
+/**
+ * pv_head_spin_check - perform para-virtualization checks for queue head
+ * @pv : pointer to struct pv_qvars
+ * @count : loop count
+ * @qcode : queue code of the supposed lock holder
+ * @lock : pointer to the qspinlock structure
+ *
+ * The following checks will be done:
+ * 2) Halt itself if lock is still not available after 2*QSPIN_THRESHOLD
+ */
+static __always_inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+ u32 qcode, struct qspinlock *lock)
+{
+ if (!static_key_false(&paravirt_spinlocks_enabled))
+ return;
+
+ if (unlikely(*count >= 2*QSPIN_THRESHOLD)) {
+ u8 lockval;
+
+ /*
+ * Set the lock byte to _QLOCK_LOCKED_SLOWPATH before
+ * trying to hibernate itself. It is possible that the
+ * lock byte had been set to _QLOCK_LOCKED_SLOWPATH
+ * already (spurious wakeup of queue head after a halt).
+ * In this case, just proceeds to sleeping.
+ *
+ * queue head lock holder
+ * ---------- -----------
+ * cpustate = PV_CPU_HALTED
+ * [1] cmpxchg(_QLOCK_LOCKED [2] cmpxchg(_QLOCK_LOCKED => 0)
+ * => _QLOCK_LOCKED_SLOWPATH) if (cmpxchg fails &&
+ * if (cmpxchg succeeds) cpustate == PV_CPU_HALTED)
+ * halt() kick()
+ *
+ * Sequence:
+ * 1,2 - slowpath flag set, queue head halted & lock holder
+ * will call slowpath
+ * 2,1 - queue head cmpxchg fails, halt is aborted
+ *
+ * If the queue head CPU is woken up by a spurious interrupt
+ * at the same time as the lock holder check the cpustate,
+ * it is possible that the lock holder will try to kick
+ * the queue head CPU which isn't halted.
+ */
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+ lockval = cmpxchg(&((union arch_qspinlock *)lock)->lock,
+ _QLOCK_LOCKED, _QLOCK_LOCKED_SLOWPATH);
+ if (lockval == 0) {
+ /*
+ * Can exit now as the lock is free
+ */
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+ *count = 0;
+ return;
+ }
+ __queue_hibernate(PV_HALT_QHEAD);
+ __queue_lockstat((pv->cpustate == PV_CPU_KICKED)
+ ? PV_WAKE_KICKED : PV_WAKE_SPURIOUS);
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+ *count = 0; /* Reset count */
+ }
+}
+
+/**
+ * pv_queue_spin_check - perform para-virtualization checks for queue member
+ * @pv : pointer to struct pv_qvars
+ * @count: loop count
+ */
+static __always_inline void pv_queue_spin_check(struct pv_qvars *pv, int *count)
+{
+ if (!static_key_false(&paravirt_spinlocks_enabled))
+ return;
+ /*
+ * Attempt to halt oneself after QSPIN_THRESHOLD spins
+ */
+ if (unlikely(*count >= QSPIN_THRESHOLD)) {
+ /*
+ * Time to hibernate itself
+ */
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+ /*
+ * In order to avoid the racing between pv_next_node_check()
+ * and pv_queue_spin_check(), 2 variables handshake is used
+ * to make sure that pv_next_node_check() won't miss setting
+ * the _QLOCK_LOCKED_SLOWPATH when the CPU is about to be
+ * halted.
+ *
+ * pv_next_node_check pv_queue_spin_check
+ * ------------------ -------------------
+ * [1] qhead = true [3] cpustate = PV_CPU_HALTED
+ * barrier() barrier()
+ * [2] if (cpustate [4] if (qhead)
+ * == PV_CPU_HALTED)
+ *
+ * Sequence:
+ * *,1,*,4,* - halt is aborted as the qhead flag is set,
+ * _QLOCK_LOCKED_SLOWPATH may or may not be set
+ * 3,4,1,2 - the CPU is halt and _QLOCK_LOCKED_SLOWPATH is set
+ */
+ barrier();
+ if (!ACCESS_ONCE(pv->qhead)) {
+ __queue_hibernate(PV_HALT_QNODE);
+ __queue_lockstat((pv->cpustate == PV_CPU_KICKED)
+ ? PV_WAKE_KICKED : PV_WAKE_SPURIOUS);
+ } else {
+ pv->qhead = false;
+ }
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+ *count = 0; /* Reset count */
+ }
+}
+
+/**
+ * pv_next_node_check - set _QLOCK_LOCKED_SLOWPATH flag if the next node
+ * is halted
+ * @pv : pointer to struct pv_qvars
+ * @count: loop count
+ *
+ * The current CPU should have gotten the lock before calling this function.
+ */
+static __always_inline void
+pv_next_node_check(struct pv_qvars *pv, struct qspinlock *lock)
+{
+ if (!static_key_false(&paravirt_spinlocks_enabled))
+ return;
+ pv->qhead = true;
+ /*
+ * Make sure qhead flag is visible before checking the cpustate flag
+ */
+ barrier();
+ if (ACCESS_ONCE(pv->cpustate) == PV_CPU_HALTED)
+ ACCESS_ONCE(((union arch_qspinlock *)lock)->lock)
+ = _QLOCK_LOCKED_SLOWPATH;
+}
+
+/**
+ * pv_set_prev - set previous queue node pointer
+ * @pv : pointer to struct pv_qvars to be set
+ * @prev: pointer to the previous node
+ */
+static __always_inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev)
+{
+ ACCESS_ONCE(pv->prev) = prev;
+ /*
+ * Make sure the prev field is set up before others
+ */
+ smp_wmb();
+}
+
+/*
+ * The following inlined functions are being used by the
+ * queue_spin_unlock_slowpath() function.
+ */
+
+/**
+ * pv_get_prev - get previous queue node pointer
+ * @pv : pointer to struct pv_qvars to be set
+ * Return: the previous queue node pointer
+ */
+static __always_inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{
+ return ACCESS_ONCE(pv->prev);
+}
+
+/**
+ * pv_kick_node - kick up the CPU of the given node
+ * @pv : pointer to struct pv_qvars of the node to be kicked
+ */
+static __always_inline void pv_kick_node(struct pv_qvars *pv)
+{
+ if (pv->cpustate != PV_CPU_HALTED) {
+ __queue_lockstat(PV_KICK_NOHALT);
+ return;
+ }
+ ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+ __queue_kick_cpu(pv->mycpu);
+}
+
+#endif /* _ASM_X86_PVQSPINLOCK_H */
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d91994d..98692cf 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -42,7 +42,11 @@ extern struct static_key paravirt_unfairlocks_enabled;
* that the clearing the lock bit is done ASAP without artificial delay
* due to compiler optimization.
*/
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+static __always_inline void __queue_spin_unlock(struct qspinlock *lock)
+#else
static inline void queue_spin_unlock(struct qspinlock *lock)
+#endif
{
union arch_qspinlock *qlock = (union arch_qspinlock *)lock;

@@ -51,6 +55,37 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
barrier();
}

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * The lock byte can have a value of _QLOCK_LOCKED_SLOWPATH to indicate
+ * that it needs to go through the slowpath to do the unlocking.
+ */
+#define _QLOCK_LOCKED_SLOWPATH 3 /* Set both bits 0 & 1 */
+
+extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
+
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ barrier();
+ if (static_key_false(&paravirt_spinlocks_enabled)) {
+ /*
+ * Need to atomically clear the lock byte to avoid racing with
+ * queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH.
+ */
+ if (likely(cmpxchg(&qlock->lock, _QLOCK_LOCKED, 0)
+ == _QLOCK_LOCKED))
+ return;
+ else
+ queue_spin_unlock_slowpath(lock);
+
+ } else {
+ __queue_spin_unlock(lock);
+ }
+}
+#endif
+
#ifdef _QCODE_SHORT
#define __queue_spin_trylock __queue_spin_trylock
/**
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 6d36731..9379417 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,15 @@
#ifdef CONFIG_PARAVIRT_SPINLOCKS
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+ .kick_cpu = paravirt_nop,
+ .hibernate = paravirt_nop,
+ .lockstat = paravirt_nop,
+#else
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
#endif
+#endif
};
EXPORT_SYMBOL(pv_lock_ops);

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 527efc3..3448010 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -58,6 +58,26 @@
*/

/*
+ * Para-virtualized queue spinlock support
+ */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#include <asm/pvqspinlock.h>
+#else
+
+struct qnode;
+struct pv_qvars {};
+static inline void pv_init_vars(struct pv_qvars *pv, int cpu_nr) {}
+static inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+ u32 qcode, struct qspinlock *lock) {}
+static inline void pv_queue_spin_check(struct pv_qvars *pv, int *count) {}
+static inline void pv_next_node_check(struct pv_qvars *pv, void *lock) {}
+static inline void pv_kick_node(struct pv_qvars *pv) {}
+static inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev) {}
+static inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{ return NULL; }
+#endif
+
+/*
* The 24-bit queue node code is divided into the following 2 fields:
* Bits 0-1 : queue node index (4 nodes)
* Bits 2-23: CPU number + 1 (4M - 1 CPUs)
@@ -86,14 +106,20 @@ enum exitval {

/*
* The queue node structure
+ *
+ * If CONFIG_PARAVIRT_SPINLOCKS is turned on, the previous node pointer in
+ * the pv structure will be used by the unfair lock code.
*/
struct qnode {
u32 qhead; /* Queue head flag */
#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
int lsteal_mask; /* Lock stealing frequency mask */
u32 prev_qcode; /* Queue code of previous node */
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
struct qnode *qprev; /* Previous queue node addr */
#endif
+#endif
+ struct pv_qvars pv; /* Para-virtualization */
struct qnode *next; /* Next queue node addr */
};

@@ -103,6 +129,20 @@ struct qnode_set {
};

/*
+ * Allow spinning loop count only if either PV spinlock or unfair lock is
+ * configured.
+ */
+#if defined(CONFIG_PARAVIRT_UNFAIR_LOCKS) || defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define DEF_LOOP_CNT(c) int c = 0
+#define INC_LOOP_CNT(c) (c)++
+#define LOOP_CNT(c) c
+#else
+#define DEF_LOOP_CNT(c)
+#define INC_LOOP_CNT(c)
+#define LOOP_CNT(c) 0
+#endif
+
+/*
* Per-CPU queue node structures
*/
static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
@@ -190,6 +230,16 @@ static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
int wset = false; /* True if wait bit was set */

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ /*
+ * Disable the quick spinning code path if PV spinlock is enabled to
+ * make sure that all the spinning CPUs can be halted when the lock
+ * holder is scheduled out.
+ */
+ if (static_key_false(&paravirt_spinlocks_enabled))
+ return 0;
+#endif
+
/*
* Fall into the quick spinning code path only if no task is waiting
* in the queue.
@@ -526,9 +576,6 @@ cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode)
* starvation.
*/
#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
-#define DEF_LOOP_CNT(c) int c = 0
-#define INC_LOOP_CNT(c) (c)++
-#define LOOP_CNT(c) c
#define LSTEAL_MIN (1 << 3)
#define LSTEAL_MAX (1 << 10)
#define LSTEAL_MIN_MASK (LSTEAL_MIN - 1)
@@ -554,6 +601,14 @@ static void unfair_init_vars(struct qnode *node)
static void
unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_qcode)
{
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ /*
+ * Disable waiter lock stealing if PV spinlock is enabled
+ */
+ if (static_key_false(&paravirt_spinlocks_enabled))
+ return;
+#endif
+
if (!static_key_false(&paravirt_unfairlocks_enabled))
return;

@@ -580,6 +635,14 @@ unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_qcode)
*/
static enum exitval unfair_check_qcode(struct qspinlock *lock, u32 my_qcode)
{
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ /*
+ * Disable waiter lock stealing if PV spinlock is enabled
+ */
+ if (static_key_false(&paravirt_spinlocks_enabled))
+ return NOTIFY_NEXT;
+#endif
+
if (!static_key_false(&paravirt_unfairlocks_enabled))
return NOTIFY_NEXT;

@@ -607,6 +670,14 @@ static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node,
int qhead;
struct qnode *next;

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ /*
+ * Disable waiter lock stealing if PV spinlock is enabled
+ */
+ if (static_key_false(&paravirt_spinlocks_enabled))
+ return NORMAL_EXIT;
+#endif
+
if (!static_key_false(&paravirt_unfairlocks_enabled) ||
((count & node->lsteal_mask) != node->lsteal_mask))
return NORMAL_EXIT;
@@ -675,9 +746,6 @@ static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node,
}

#else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
-#define DEF_LOOP_CNT(c)
-#define INC_LOOP_CNT(c)
-#define LOOP_CNT(c) 0

static void unfair_init_vars(struct qnode *node) {}
static void unfair_set_vars(struct qnode *node, struct qnode *prev,
@@ -748,6 +816,7 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
struct qnode *next;
u32 prev_qcode;
enum exitval exitval;
+ DEF_LOOP_CNT(hcnt);

/*
* Exchange current copy of the queue node code
@@ -767,6 +836,7 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
DEF_LOOP_CNT(cnt);

unfair_set_vars(node, prev, prev_qcode);
+ pv_set_prev(&node->pv, prev);
ACCESS_ONCE(prev->next) = node;
/*
* Wait until the queue head flag is on
@@ -780,13 +850,17 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
goto release_node;
else if (unlikely(exitval == NOTIFY_NEXT))
goto notify_next;
+ pv_queue_spin_check(&node->pv, LOOP_CNT(&cnt));
} while (!ACCESS_ONCE(node->qhead));
+ } else {
+ ACCESS_ONCE(node->qhead) = true;
}

/*
* At the head of the wait queue now
*/
for (;; arch_mutex_cpu_relax()) {
+ INC_LOOP_CNT(hcnt);
qsval = atomic_read(&lock->qlcode);
if (qsval & _QLOCK_LOCK_MASK)
continue; /* Lock not available yet */
@@ -820,6 +894,12 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
} else if (queue_spin_trylock_and_clr_qcode(lock, my_qcode)) {
goto release_node;
}
+
+ /*
+ * Perform para-virtualization checks
+ */
+ pv_head_spin_check(&node->pv, LOOP_CNT(&hcnt), prev_qcode,
+ lock);
}

notify_next:
@@ -832,6 +912,7 @@ set_qhead:
/*
* The next one in queue is now at the head
*/
+ pv_next_node_check(&next->pv, lock);
ACCESS_ONCE(next->qhead) = true;

release_node:
@@ -871,6 +952,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
node->qhead = false;
node->next = NULL;
unfair_init_vars(node);
+ pv_init_vars(&node->pv, cpu_nr);

/*
* The lock may be available at this point, try again if no task was
@@ -882,3 +964,47 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
queue_spin_lock_slowerpath(lock, node, my_qcode);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/**
+ * queue_spin_unlock_slowpath - kick up the CPU of the queue head
+ * @lock : Pointer to queue spinlock structure
+ *
+ * The lock is released after finding the queue head to avoid racing
+ * condition between the queue head and the lock holder.
+ */
+void queue_spin_unlock_slowpath(struct qspinlock *lock)
+{
+ struct qnode *node, *prev;
+ u32 qcode = queue_get_qcode(lock);
+
+ /*
+ * Get the queue tail node
+ */
+ node = xlate_qcode(qcode);
+
+ /*
+ * Locate the queue head node by following the prev pointer from
+ * tail to head.
+ * It is assumed that the PV guests won't have that many CPUs so
+ * that it won't take a long time to follow the pointers.
+ */
+ while (!ACCESS_ONCE(node->qhead)) {
+ prev = pv_get_prev(&node->pv);
+ if (prev)
+ node = prev;
+ else
+ /*
+ * Delay a bit to allow the prev pointer to be set up
+ */
+ arch_mutex_cpu_relax();
+ }
+ /*
+ * Found the queue head, now release the lock before waking it up
+ */
+ __queue_spin_unlock(lock);
+ pv_kick_node(&node->pv);
+}
+EXPORT_SYMBOL(queue_spin_unlock_slowpath);
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
--
1.7.1

2014-04-02 13:28:58

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 07/10] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/spinlock.h | 4 ++--
arch/x86/kernel/kvm.c | 2 +-
arch/x86/kernel/paravirt-spinlocks.c | 4 ++--
arch/x86/xen/spinlock.c | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 958d20f..428d0d1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
/* How long a lock should spin before we consider blocking */
#define SPIN_THRESHOLD (1 << 15)

-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

#ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG &&
- static_key_false(&paravirt_ticketlocks_enabled)) {
+ static_key_false(&paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;

prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 713f1b3..8e646a7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;

- static_key_slow_inc(&paravirt_ticketlocks_enabled);
+ static_key_slow_inc(&paravirt_spinlocks_enabled);
printk(KERN_INFO "KVM setup paravirtual spinlock\n");

return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 7dfd02d..6d36731 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
};
EXPORT_SYMBOL(pv_lock_ops);

-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
#endif

#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 581521c..06f4a64 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_pvspin)
return 0;

- static_key_slow_inc(&paravirt_ticketlocks_enabled);
+ static_key_slow_inc(&paravirt_spinlocks_enabled);
return 0;
}
early_initcall(xen_init_spinlocks_jump);
--
1.7.1

2014-04-02 13:28:44

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 04/10] qspinlock: Optimized code path for 2 contending tasks

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX x86-64
CPU with 2 different types of loads - standalone (lock and protected
data in different cachelines) and embedded (lock and protected data
in the same cacheline).

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135/111 135/101 0%/-9%
2 1045/950 1884/1853 +80%/+95%
3 1827/1783 2256/2264 +23%/+27%
4 2689/2725 2880/2884 +7%/+6%
5 3736/3748 3636/3617 -3%/-3%
6 4942/4984 4294/4253 -13%/-15%
7 6304/6319 4976/4958 -21%/-22%
8 7736/7629 5662/5651 -27%/-26%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special specific optimized code path
for 2 contending tasks was added. This special code path can only be
activated with less than 16K of configured CPUs because it uses a byte
in the 32-bit lock word to hold a waiting bit for the 2nd contending
tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
2 1045/950 951/955 -9%/+1%

In a multi-socketed server, the optimized code path also seems to
produce a pretty good performance improvement in cross-node contention
traffic at low contention level. The table below show the performance
with 1 contending task per node:

[Standalone]
# of nodes Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135 135 0%
2 4452 1024 -77%
3 10767 14030 +30%
4 20835 10740 -48%

Except some drop in performance at the 3 contending tasks level,
the queue spinlock performs much better than the ticket spinlock at
2 and 4 contending tasks level.

With IvyBridge-EX (2.8 GHz), the performance profile of qspinlock vs
ticket spinlock changes quite a bit due to the fact that the pause
instruction in IvyBridge-EX is about 3 times as long as that in
Westmere-EX (25 cycles vs. 8 cycles according to my measurement). So
spinning on the lock word by the ticket spinlock is less problematic
in IvyBridge-EX.

The table below shows the results of the same low-level spinlock test
run on one socket of IvyBridge-EX (15 cores, 1M iterations instead
of 5M):

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 59/48 56/42 -5%/-13%
2 514/487 289/345 -44%/-29%
3 812/768 1048/1053 +29%/+37%
4 1136/1077 1219/1220 +7%/+13%
5 1464/1398 1560/1581 +7%/+13%
6 1806/1787 1952/1959 +8%/+10%
7 2185/2204 2360/2377 +8%/ +8%
8 2582/2656 2759/2747 +7%/ +3%
9 2979/3131 3120/3103 +5%/ -1%
10 3398/3602 3484/3498 +3%/ -3%
11 3848/4110 3807/3829 -1%/ -7%
12 4326/4655 4132/4117 -4%/-12%

The queue spinlock is still faster than the ticket spinlock with 1 or
2 contending tasks (light spinlock contention). After that, ticket
spinlock is faster (moderate spinlock contention) until there are
11 or more contending tasks for a standalone spinlock or 9 or more
contending tasks for an embedded spinlocks.

The table below shows the performance profile when there is one
contending task are from each of the different nodes in an 8-node
prototype IvyBridge-EX machine:

[Standalone]
# of nodes Ticket lock Queue lock %Change
---------- ----------- ---------- -------
2 532 503 -5%
3 2449 3812 +56%
4 6211 4571 -26%
5 8606 6104 -29%
6 9011 7641 -15%
7 12907 8373 -35%
8 15094 10259 -32%

There is some performance drop at the 3 contending tasks level.
Other than that, queue spinlock is faster than ticket spinlock.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 3 +-
kernel/locking/qspinlock.c | 212 +++++++++++++++++++++++++++++++++-----
2 files changed, 187 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index f058b91..265b10b 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -21,9 +21,10 @@ union arch_qspinlock {
struct qspinlock slock;
struct {
u8 lock; /* Lock bit */
- u8 reserved;
+ u8 wait; /* Waiting bit */
u16 qcode; /* Queue code */
};
+ u16 lock_wait; /* Lock and wait bits */
u32 qlcode; /* Complete lock word */
};

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 45c68a4..cf16bba 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -121,6 +121,8 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
* o lock - the lock byte *
* o qcode - the queue node code *
* o qlcode - the 32-bit qspinlock word *
+ * o wait - the waiting byte *
+ * o lock_wait - the combined lock and waiting bytes *
* *
************************************************************************
*/
@@ -131,9 +133,135 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
* architectures that allows atomic 8/16 bit operations:
* 1) The 16-bit queue code can be accessed or modified directly as a
* 16-bit short value without disturbing the first 2 bytes.
+ * 2) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ * for waiting lock acquirer so that it won't need to go through the
+ * MCS style locking queuing which has a higher overhead.
*/
+
+/*
+ * Masks for the lock and wait bits
+ */
+#define _QLOCK_WAIT_SHIFT 8 /* Waiting bit position */
+#define _QLOCK_WAITING (1 << _QLOCK_WAIT_SHIFT)
+#define _QLOCK_LW_MASK (_QLOCK_WAITING | _QLOCK_LOCKED)
+
#define queue_encode_qcode(cpu, idx) (((cpu) + 1) << 2 | (idx))

+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - quick spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue.
+ *
+ * The lock and wait bits can be in one of following 4 states:
+ *
+ * State lock wait
+ * ----- ---------
+ * [0] 0 0 Lock is free and no one is waiting
+ * [1] 1 0 Lock is not available, but no one is waiting
+ * [2] 0 1 Lock is free, but a waiter is present
+ * [3] 1 1 Lock is not available and a waiter is present
+ *
+ * A task entering the quick path will set the wait bit and be in either
+ * either states 2 or 3. The allowable transitions are:
+ *
+ * [3] => [2] => [1] => [0]
+ * ^ |
+ * +-------------+
+ *
+ * N.B. The wait bit won't be set if the queue code has been set before.
+ * As a result, the queue head can safely get the lock without using
+ * atomic operation as long as it checks that the wait bit hasn't been
+ * set. The cpu_relax() function is used after atomic operation whereas
+ * arch_mutex_cpu_relax() is used after a read.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+ int wset = false; /* True if wait bit was set */
+
+ /*
+ * Fall into the quick spinning code path only if no task is waiting
+ * in the queue.
+ */
+ for (; likely(!(qsval >> _QCODE_OFFSET));
+ qsval = atomic_read(&lock->qlcode)) {
+
+ if (unlikely((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK)) {
+ /*
+ * Wait a while to see if either lock or wait bit
+ * is cleared. Leave if the condition isn't changed.
+ */
+ cpu_relax();
+ cpu_relax();
+ qsval = atomic_read(&lock->qlcode);
+ if ((qsval >> _QCODE_OFFSET) ||
+ ((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK))
+ return 0;
+ }
+ if (unlikely(qsval == 0)) {
+ /*
+ * Attempt to acquire the lock directly here
+ */
+ qsval = atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED);
+ if (qsval == 0)
+ return 1; /* Got the lock */
+ cpu_relax();
+ continue;
+ }
+ if (unlikely(qsval & _QLOCK_WAITING)) {
+ arch_mutex_cpu_relax();
+ wset = true;
+ continue;
+ }
+
+ /*
+ * If the wait bit has just been cleared, the new lock holder
+ * should be busy in the critical section. It was found that
+ * waiting a bit longer before the exchange operation helps
+ * performance.
+ */
+ if (unlikely(wset))
+ arch_mutex_cpu_relax();
+
+ if (atomic_cmpxchg(&lock->qlcode, qsval, qsval|_QLOCK_WAITING)
+ != qsval) {
+ /*
+ * Another task has got the wait bit before us or
+ * the queue code has been set. Wait a bit and try
+ * again.
+ */
+ cpu_relax();
+ wset = false;
+ continue;
+ }
+
+ /*
+ * Wait a bit here if the lock bit was set.
+ */
+ if (unlikely(qsval & _QLOCK_LOCKED))
+ arch_mutex_cpu_relax();
+
+ /*
+ * Now wait until the lock bit is cleared
+ */
+ while (smp_load_acquire(&qlock->qlcode) & _QLOCK_LOCKED)
+ arch_mutex_cpu_relax();
+
+ /*
+ * Set the lock bit & clear the waiting bit simultaneously
+ * No lock stealing is allowed when this quick path is active.
+ */
+ ACCESS_ONCE(qlock->lock_wait) = _QLOCK_LOCKED;
+ return 1;
+ }
+ return 0;
+}
+
#define queue_code_xchg queue_code_xchg
/**
* queue_code_xchg - exchange a queue code value
@@ -192,7 +320,7 @@ static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
{
union arch_qspinlock *qlock = (union arch_qspinlock *)lock;

- return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0;
+ return cmpxchg(&qlock->lock_wait, 0, _QLOCK_LOCKED) == 0;
}
#endif
#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
@@ -203,6 +331,10 @@ static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
* that may get superseded by a more optimized version. *
************************************************************************
*/
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif

#ifndef __queue_spin_trylock
/**
@@ -365,37 +497,20 @@ static inline void put_qnode(void)
}

/**
- * queue_spin_lock_slowpath - acquire the queue spinlock
- * @lock : Pointer to queue spinlock structure
- * @qsval: Current value of the queue spinlock 32-bit word
+ * queue_spin_lock_slowerpath - put lock waiter in queue & wait for its turn
+ * @lock : Pointer to queue spinlock structure
+ * @node : Pointer to the queue node
+ * @my_qcode: Queue code for the queue node
*/
-void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
+ struct qnode *node, u32 my_qcode)
{
- unsigned int cpu_nr;
- struct qnode *node, *next;
- u32 prev_qcode, my_qcode;
+ int qsval;
+ struct qnode *next;
+ u32 prev_qcode;
enum exitval exitval;

/*
- * Get the queue node
- */
- cpu_nr = smp_processor_id();
- node = get_qnode(cpu_nr, &my_qcode);
-
- /*
- * Initialize the queue node
- */
- node->qhead = false;
- node->next = NULL;
-
- /*
- * The lock may be available at this point, try again if no task was
- * waiting in the queue.
- */
- if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock))
- goto release_node;
-
- /*
* Exchange current copy of the queue node code
*/
exitval = queue_code_xchg(lock, &prev_qcode, my_qcode);
@@ -463,4 +578,47 @@ set_qhead:
release_node:
put_qnode();
}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ *
+ * The slowpath was broken into a regular slowpath and a slowerpath. The
+ * slowpath contains the quick spinning code and the slower path contains
+ * the regular lock waiter queuing code. This is to avoid the complexity in
+ * the queuing code from slowing down the quick spinning path due to
+ * compiler optimization.
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+ struct qnode *node;
+ u32 my_qcode, cpu_nr;
+
+ /*
+ * Try the quick spinning code path
+ */
+ if (queue_spin_trylock_quick(lock, qsval))
+ return;
+ /*
+ * Get the queue node
+ */
+ cpu_nr = smp_processor_id();
+ node = get_qnode(cpu_nr, &my_qcode);
+
+ /*
+ * Initialize the queue node
+ */
+ node->qhead = false;
+ node->next = NULL;
+
+ /*
+ * The lock may be available at this point, try again if no task was
+ * waiting in the queue.
+ */
+ if (likely(!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)))
+ put_qnode(); /* Return the queue node */
+ else
+ queue_spin_lock_slowerpath(lock, node, my_qcode);
+}
EXPORT_SYMBOL(queue_spin_lock_slowpath);
--
1.7.1

2014-04-02 13:29:17

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7 : the locked byte
Bits 8-9 : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU. The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

Lock Type Time (ns)
--------- ---------
Ticket spinlock 14.1
Queue spinlock 8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

+------------+----------+-----------------+---------+
| Kernel | 3.13 JPM | 3.13 with | %Change |
| | | qspinlock patch | |
+------------+----------+-----------------+---------+
| 10-100 users |
+------------+----------+-----------------+---------+
|custom | 357459 | 363109 | +1.58% |
|dbase | 496847 | 498801 | +0.39% |
|disk | 2925312 | 2771387 | -5.26% |
|five_sec | 166612 | 169215 | +1.56% |
|fserver | 382129 | 383279 | +0.30% |
|high_systime| 16356 | 16380 | +0.15% |
|short | 4521978 | 4257363 | -5.85% |
+------------+----------+-----------------+---------+
| 200-1000 users |
+------------+----------+-----------------+---------+
|custom | 449070 | 447711 | -0.30% |
|dbase | 845029 | 853362 | +0.99% |
|disk | 2725249 | 4892907 | +79.54% |
|five_sec | 169410 | 170638 | +0.72% |
|fserver | 489662 | 491828 | +0.44% |
|high_systime| 142823 | 143790 | +0.68% |
|short | 7435288 | 9016171 | +21.26% |
+------------+----------+-----------------+---------+
| 1100-2000 users |
+------------+----------+-----------------+---------+
|custom | 432470 | 432570 | +0.02% |
|dbase | 889289 | 890026 | +0.08% |
|disk | 2565138 | 5008732 | +95.26% |
|five_sec | 169141 | 170034 | +0.53% |
|fserver | 498569 | 500701 | +0.43% |
|high_systime| 229913 | 245866 | +6.94% |
|short | 8496794 | 8281918 | -2.53% |
+------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

26.19% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--47.28%-- evict
|--46.87%-- inode_sb_list_add
|--1.24%-- xlog_cil_insert_items
|--0.68%-- __remove_inode_hash
|--0.67%-- inode_wait_for_writeback
--3.26%-- [...]
22.96% swapper [kernel.kallsyms] [k] cpu_idle_loop
5.56% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
4.87% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.04% reaim [kernel.kallsyms] [k] mspin_lock
1.30% reaim [kernel.kallsyms] [k] memcpy
1.08% reaim [unknown] [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

26.82% swapper [kernel.kallsyms] [k] cpu_idle_loop
4.66% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
3.97% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.40% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath
|--88.31%-- _raw_spin_lock
| |--36.02%-- inode_sb_list_add
| |--35.09%-- evict
| |--16.89%-- xlog_cil_insert_items
| |--6.30%-- try_to_wake_up
| |--2.20%-- _xfs_buf_find
| |--0.75%-- __remove_inode_hash
| |--0.72%-- __mutex_lock_slowpath
| |--0.53%-- load_balance
|--6.02%-- _raw_spin_lock_irqsave
| |--74.75%-- down_trylock
| |--9.69%-- rcu_check_quiescent_state
| |--7.47%-- down
| |--3.57%-- up
| |--1.67%-- rwsem_wake
| |--1.00%-- remove_wait_queue
| |--0.56%-- pagevec_lru_move_fn
|--5.39%-- _raw_spin_lock_irq
| |--82.05%-- rwsem_down_read_failed
| |--10.48%-- rwsem_down_write_failed
| |--4.24%-- __down
| |--2.74%-- __schedule
--0.28%-- [...]
2.20% reaim [kernel.kallsyms] [k] memcpy
1.84% reaim [unknown] [.] 0x000000000041517b
1.77% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--21.08%-- xlog_cil_insert_items
|--10.14%-- xfs_icsb_modify_counters
|--7.20%-- xfs_iget_cache_hit
|--6.56%-- inode_sb_list_add
|--5.49%-- _xfs_buf_find
|--5.25%-- evict
|--5.03%-- __remove_inode_hash
|--4.64%-- __mutex_lock_slowpath
|--3.78%-- selinux_inode_free_security
|--2.95%-- xfs_inode_is_filestream
|--2.35%-- try_to_wake_up
|--2.07%-- xfs_inode_set_reclaim_tag
|--1.52%-- list_lru_add
|--1.16%-- xfs_inode_clear_eofblocks_tag
:
1.30% reaim [kernel.kallsyms] [k] effective_load
1.27% reaim [kernel.kallsyms] [k] mspin_lock
1.10% reaim [kernel.kallsyms] [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/asm-generic/qspinlock.h | 122 +++++++++++
include/asm-generic/qspinlock_types.h | 49 +++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qspinlock.c | 371 +++++++++++++++++++++++++++++++++
5 files changed, 550 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..8525931
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+ return atomic_read(&lock->qlcode) & _QLOCK_LOCK_MASK;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+ return !(atomic_read(&lock.qlcode) & _QLOCK_LOCK_MASK);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+ return atomic_read(&lock->qlcode) & ~_QLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+ if (!atomic_read(&lock->qlcode) &&
+ (atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+ int qsval;
+
+ /*
+ * To reduce memory access to only once for the cold cache case,
+ * a direct cmpxchg() is performed in the fastpath to optimize the
+ * uncontended case. The contended performance, however, may suffer
+ * a bit because of that.
+ */
+ qsval = atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED);
+ if (likely(qsval == 0))
+ return;
+ queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * Use an atomic subtraction to clear the lock bit.
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l) queue_spin_is_locked(l)
+#define arch_spin_is_contended(l) queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l) queue_spin_value_unlocked(l)
+#define arch_spin_lock(l) queue_spin_lock(l)
+#define arch_spin_trylock(l) queue_spin_trylock(l)
+#define arch_spin_unlock(l) queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f) queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..fbfe898
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,49 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * The bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+ atomic_t qlcode; /* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET 8
+#define _QLOCK_LOCKED 1U
+#define _QLOCK_LOCK_MASK 0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+ bool
+
+config QUEUE_SPINLOCK
+ def_bool y if ARCH_USE_QUEUE_SPINLOCK
+ depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
endif
obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..92ed540
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,371 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1 (4M - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES 4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET _QCODE_OFFSET
+#endif
+
+/*
+ * Function exit status
+ */
+enum exitval {
+ NORMAL_EXIT = 0,
+ NOTIFY_NEXT , /* Notify the next waiting node CPU */
+ RELEASE_NODE /* Release current node directly */
+};
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. It is retained for future extension where new
+ * fields may be added.
+ */
+struct qnode {
+ u32 qhead; /* Queue head flag */
+ struct qnode *next; /* Next queue node addr */
+};
+
+struct qnode_set {
+ struct qnode nodes[MAX_QNODES];
+ int node_idx; /* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function *
+ * that may get superseded by a more optimized version. *
+ ************************************************************************
+ */
+
+#ifndef __queue_spin_trylock
+/**
+ * __queue_spin_trylock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ *
+ * This is an unfair version of the trylock which should only be called
+ * by a caller who is entitled to acquire the lock.
+ */
+static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
+{
+ int qlcode = atomic_read(&lock->qlcode);
+
+ if (!(qlcode & _QLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+ qlcode, qlcode|_QLOCK_LOCKED) == qlcode))
+ return 1;
+ return 0;
+}
+#endif /* __queue_spin_trylock */
+
+#ifndef qsval_to_qcode
+/**
+ * qsval_to_qcode - Convert a queue spinlock value to a queue code
+ * @qsval : Queue spinlock value
+ * Return : The corresponding queue code value
+ */
+static inline u32
+qsval_to_qcode(int qsval)
+{
+ return (u32)(qsval & ~_QLOCK_LOCK_MASK);
+}
+#endif /* qsval_to_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ */
+static inline u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+ return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+ (qn_idx << _QCODE_VAL_OFFSET);
+}
+#endif /* queue_encode_qcode */
+
+#ifndef queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: An enum exitval value
+ */
+static inline enum exitval
+queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+ ncode |= _QLOCK_LOCKED; /* Set lock bit */
+
+ /*
+ * Exchange current copy of the queue node code
+ */
+ *ocode = atomic_xchg(&lock->qlcode, ncode);
+
+ if (likely(*ocode & _QLOCK_LOCKED)) {
+ *ocode &= ~_QLOCK_LOCKED; /* Clear the lock bit */
+ return NORMAL_EXIT;
+ }
+ /*
+ * It is possible that we may accidentally steal the lock during
+ * the unlock-lock transition. If this is the case, we need to either
+ * release it if not the head of the queue or get the lock and be
+ * done with it.
+ */
+ if (*ocode == 0) {
+ u32 qcode;
+
+ /*
+ * Got the lock since it is at the head of the queue
+ * Now try to atomically clear the queue code.
+ */
+ qcode = atomic_cmpxchg(&lock->qlcode, ncode, _QLOCK_LOCKED);
+ /*
+ * The cmpxchg fails only if one or more tasks are added to
+ * the queue. In this case, NOTIFY_NEXT is returned instead
+ * of RELEASE_NODE.
+ */
+ return (qcode != ncode) ? NOTIFY_NEXT : RELEASE_NODE;
+ }
+ /*
+ * Accidentally steal the lock, release the lock and
+ * let the queue head get it.
+ */
+ queue_spin_unlock(lock);
+ return NORMAL_EXIT;
+}
+#endif /* queue_code_xchg */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath() *
+ * function. *
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+ u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+ u8 qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+ return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address as well as the queue code
+ * @cpu : CPU number
+ * @qcode : Pointer to queue code value [out]
+ * Return : queue node address & queue code in qcode
+ */
+static inline struct qnode *get_qnode(int cpu, u32 *qcode)
+{
+ struct qnode_set *qset = this_cpu_ptr(&qnset);
+ int qn_idx = qset->node_idx++;
+
+ /*
+ * It should never happen that all the queue nodes are being used.
+ */
+ BUG_ON(qn_idx >= MAX_QNODES);
+ *qcode = queue_encode_qcode(cpu, qn_idx);
+ return qset->nodes + qn_idx;
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static inline void put_qnode(void)
+{
+ this_cpu_dec(qnset.node_idx);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+ unsigned int cpu_nr;
+ struct qnode *node, *next;
+ u32 prev_qcode, my_qcode;
+ enum exitval exitval;
+
+ /*
+ * Get the queue node
+ */
+ cpu_nr = smp_processor_id();
+ node = get_qnode(cpu_nr, &my_qcode);
+
+ /*
+ * Initialize the queue node
+ */
+ node->qhead = false;
+ node->next = NULL;
+
+ /*
+ * The lock may be available at this point, try again if no task was
+ * waiting in the queue.
+ */
+ if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock))
+ goto release_node;
+
+ /*
+ * Exchange current copy of the queue node code
+ */
+ exitval = queue_code_xchg(lock, &prev_qcode, my_qcode);
+ if (unlikely(exitval == NOTIFY_NEXT))
+ goto notify_next;
+ else if (unlikely(exitval == RELEASE_NODE))
+ goto release_node;
+
+ if (prev_qcode) {
+ /*
+ * Not at the queue head, get the address of the previous node
+ * and set up the "next" fields of the that node.
+ */
+ struct qnode *prev = xlate_qcode(prev_qcode);
+
+ ACCESS_ONCE(prev->next) = node;
+ /*
+ * Wait until the queue head flag is on
+ */
+ do {
+ arch_mutex_cpu_relax();
+ } while (!ACCESS_ONCE(node->qhead));
+ }
+
+ /*
+ * At the head of the wait queue now
+ */
+ for (;; arch_mutex_cpu_relax()) {
+ qsval = atomic_read(&lock->qlcode);
+ next = ACCESS_ONCE(node->next);
+ if (qsval & _QLOCK_LOCK_MASK)
+ continue; /* Lock not available yet */
+
+ if (likely(qsval_to_qcode(qsval) != my_qcode)) {
+ /*
+ * There are additional lock waiters in the queue.
+ */
+ if (unlikely(!__queue_spin_trylock(lock)))
+ continue; /* Trylock fails! */
+ if (likely(next))
+ goto set_qhead;
+ else
+ goto notify_next;
+ /*
+ * The queue head is the only lock waiter in the queue.
+ * Get the lock & clear the queue code simultaneously.
+ */
+ } else if (queue_spin_trylock_and_clr_qcode(lock, my_qcode)) {
+ goto release_node;
+ }
+ }
+
+notify_next:
+ /*
+ * Wait, if needed, until the next one in queue set up the next field
+ */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+set_qhead:
+ /*
+ * The next one in queue is now at the head
+ */
+ ACCESS_ONCE(next->qhead) = true;
+
+release_node:
+ put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
--
1.7.1

2014-04-02 13:29:34

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN

This patch adds the necessary KVM specific code to allow XEN to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/xen/spinlock.c | 119 +++++++++++++++++++++++++++++++++++++++++++++--
kernel/Kconfig.locks | 2 +-
2 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 06f4a64..6bbe798 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
#include "xen-ops.h"
#include "debugfs.h"

+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
__ticket_t want;
};

-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
static cpumask_t waiting_cpus;

-static bool xen_pvspin = true;
__visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
{
int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,94 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
}
}

+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 kick_stats; /* CPU kick count */
+static u32 kick_nohalt_stats; /* Kick but not halt count */
+static u32 halt_qhead_stats; /* Queue head halting count */
+static u32 halt_qnode_stats; /* Queue node halting count */
+static u32 wake_kick_stats; /* Wakeup by kicking count */
+static u32 wake_spur_stats; /* Spurious wakeup count */
+
+static inline void xen_kick_stats(void)
+{
+ add_smp(&kick_stats, 1);
+}
+
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+ if (type == PV_HALT_QHEAD)
+ add_smp(&halt_qhead_stats, 1);
+ else /* type == PV_HALT_QNODE */
+ add_smp(&halt_qnode_stats, 1);
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+ if (type == PV_WAKE_KICKED)
+ add_smp(&wake_kick_stats, 1);
+ else if (type == PV_WAKE_SPURIOUS)
+ add_smp(&wake_spur_stats, 1);
+ else /* type == PV_KICK_NOHALT */
+ add_smp(&kick_nohalt_stats, 1);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void xen_kick_stats(void)
+{
+}
+
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+static void xen_kick_cpu(int cpu)
+{
+ xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+ xen_kick_stats();
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void xen_hibernate(enum pv_lock_stats type)
+{
+ int irq = __this_cpu_read(lock_kicker_irq);
+ unsigned long flags;
+
+ /* If kicker interrupts not initialized yet, just spin */
+ if (irq == -1)
+ return;
+
+ /*
+ * Make sure an interrupt handler can't upset things in a
+ * partially setup state.
+ */
+ local_irq_save(flags);
+
+ xen_halt_stats(type);
+ /* clear pending */
+ xen_clear_irq_pending(irq);
+
+ /* Allow interrupts while blocked */
+ local_irq_restore(flags);
+
+ /*
+ * If an interrupt happens here, it will leave the wakeup irq
+ * pending, which will cause xen_poll_irq() to return
+ * immediately.
+ */
+
+ /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+ xen_poll_irq(irq);
+}
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
static irqreturn_t dummy_handler(int irq, void *dev_id)
{
BUG();
@@ -258,7 +349,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
}

-
/*
* Our init of PV spinlocks is split in two init functions due to us
* using paravirt patching and jump labels patching and having to do
@@ -275,8 +365,14 @@ void __init xen_init_spinlocks(void)
return;
}

+#ifdef CONFIG_QUEUE_SPINLOCK
+ pv_lock_ops.kick_cpu = xen_kick_cpu;
+ pv_lock_ops.hibernate = xen_hibernate;
+ pv_lock_ops.lockstat = xen_lock_stats;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
}

/*
@@ -318,6 +414,7 @@ static int __init xen_spinlock_debugfs(void)

d_spin_debug = debugfs_create_dir("spinlocks", d_xen);

+#ifndef CONFIG_QUEUE_SPINLOCK
debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);

debugfs_create_u32("taken_slow", 0444, d_spin_debug,
@@ -337,7 +434,19 @@ static int __init xen_spinlock_debugfs(void)

debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
-
+#else /* CONFIG_QUEUE_SPINLOCK */
+ debugfs_create_u32("kick_stats", 0644, d_spin_debug, &kick_stats);
+ debugfs_create_u32("kick_nohalt_stats",
+ 0644, d_spin_debug, &kick_nohalt_stats);
+ debugfs_create_u32("halt_qhead_stats",
+ 0644, d_spin_debug, &halt_qhead_stats);
+ debugfs_create_u32("halt_qnode_stats",
+ 0644, d_spin_debug, &halt_qnode_stats);
+ debugfs_create_u32("wake_kick_stats",
+ 0644, d_spin_debug, &wake_kick_stats);
+ debugfs_create_u32("wake_spur_stats",
+ 0644, d_spin_debug, &wake_spur_stats);
+#endif /* CONFIG_QUEUE_SPINLOCK */
return 0;
}
fs_initcall(xen_spinlock_debugfs);
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index a70fdeb..451e392 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK

config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
- depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
+ depends on SMP
--
1.7.1

2014-04-02 13:30:10

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 09/10] pvqspinlock, x86: Enable qspinlock PV support for KVM

This patch adds the necessary KVM specific code to allow KVM to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

Two KVM guests of 20 CPU cores (2 nodes) were created for performance
testing in one of the following three configurations:
1) Only 1 VM is active
2) Both VMs are active and they share the same 20 physical CPUs
(200% overcommit)
3) Both VMs are active and they shares 30 physical CPUs (10 delicated
and 10 shared - 133% overcommit)

The tests run included the disk workload of the AIM7 benchmark
on both ext4 and xfs RAM disks at 3000 users on a 3.14-rc8 based
kernel. A kernel compilation test was also run and the execution
times were noted. With to VMs running, the "idle=poll" kernel option
was added to simulate a busy guest. The entry "unfair + PV qspinlock"
below means that both the unfair lock and PV spinlock configuration
options were turned on.

AIM7 XFS Disk Test (no overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 2380952 7.56 107.34 5.65
qspinlock 2400000 7.50 105.68 5.68
PV qspinlock 2390438 7.53 102.52 5.48
unfair qspinloc 2432432 7.40 105.30 5.72
unfair + PV qspinlock 2340702 7.69 107.67 5.65

AIM7 XFS Disk Test (133% overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 1137081 15.83 213.29 13.03
qspinlock 1132075 15.90 221.92 13.92
PV qspinlock 1097561 16.40 229.30 13.72
unfair qspinloc 1138520 15.81 220.13 13.10
unfair + PV qspinlock 1118707 16.09 225.08 13.25

AIM7 XFS Disk Test (200% overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 577108 31.19 447.10 26.60
qspinlock 117493 153.20 1006.06 59.60
PV qspinlock 568361 31.67 402.69 25.08
unfair qspinloc 604432 29.78 402.20 26.17
unfair + PV qspinlock 629591 28.59 364.56 23.74

AIM7 EXT4 Disk Test (no overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 1284797 14.01 172.90 5.59
qspinlock 1169591 15.39 177.13 5.62
PV qspinlock 1243953 14.47 179.86 5.34
unfair qspinloc 1474201 12.21 145.08 5.50
unfair + PV qspinlock 1486375 12.11 146.55 5.58

AIM7 EXT4 Disk Test (133% overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 126130 142.71 2534.69 18.23
qspinlock 119792 150.26 2767.86 24.32
PV qspinlock 116928 153.94 2804.52 20.21
unfair qspinloc 877192 20.52 262.69 10.80
unfair + PV qspinlock 740741 24.30 328.64 12.29

AIM7 EXT4 Disk Test (200% overcommit)
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
PV ticketlock 100880 178.43 3108.33 35.78
qspinlock 54995 327.30 5023.58 54.73
PV qspinlock 104100 172.91 2947.03 33.69
unfair qspinloc 390033 46.15 612.80 27.08
unfair + PV qspinlock 357640 50.33 670.15 29.22

The kernel build test (make -j 20) results are as follows:

(no overcommit)
kernel Real Time Sys Time Usr Time
----- --------- -------- --------
PV ticketlock 8m42.284s 17m2.638s 117m6.862s
qspinlock 8m56.907s 16m34.614s 117m28.756s
PV qspinlock 8m30.477s 16m51.550s 117m28.743s
unfair qspinlock 9m5.152s 16m48.353s 117m50.292s
unfair + PV qspinlock 8m41.729s 16m51.905s 117m20.809s

(133% overcommit)
kernel Real Time Sys Time Usr Time
----- --------- -------- --------
PV ticketlock 13m8.703s 32m14.437s 187m34.016s
qspinlock 13m3.169s 32m9.641s 186m40.085s
PV qspinlock 12m53.279s 32m16.687s 186m32.541s
unfair qspinlock 12m56.707s 31m55.581s 187m45.494s
unfair + PV qspinlock 12m46.688s 32m5.035s 186m15.042s

(200% overcommit)
kernel Real Time Sys Time Usr Time
----- --------- -------- --------
PV ticketlock 20m9.236s 41m35.786s 283m56.333s
qspinlock 26m41.294s 74m55.585s 346m31.981s
PV qspinlock 20m14.312s 41m34.621s 283m50.145s
unfair qspinlock 19m57.384s 40m40.880s 282m54.679s
unfair + PV qspinlock 20m17.564s 41m33.687s 283m1.035s

In term of spinlock contention, the ordering of the 3 workloads are:

kernel build < AIM7 disk xfs < AIM7 disk ext4

With no overcommit, the PV code and unfair lock doesn't differ that
much from the plain qspinlock with the exception of the AIM7 disk
ext4 test which has high spinlock contention.

With 133% overcommit, there were some performance benefit with PV
and unfair lock. With heavy spinlock contention in the ext4 test,
unfair lock performed much better than the rests.

With 200% overcommit, we saw even more benefit with PV and unfair
locks. Again unfair lock provided a much better performance boost
with heavy spinlock contention.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/kernel/kvm.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 2 +-
2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8e646a7..7d97e58 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
}

+#ifndef CONFIG_QUEUE_SPINLOCK
enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -795,6 +796,110 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
}
}
}
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 kick_stats; /* CPU kick count */
+static u32 kick_nohalt_stats; /* Kick but not halt count */
+static u32 halt_qhead_stats; /* Queue head halting count */
+static u32 halt_qnode_stats; /* Queue node halting count */
+static u32 wake_kick_stats; /* Wakeup by kicking count */
+static u32 wake_spur_stats; /* Spurious wakeup count */
+
+static int __init kvm_spinlock_debugfs(void)
+{
+ d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
+ if (!d_kvm_debug) {
+ printk(KERN_WARNING
+ "Could not create 'kvm' debugfs directory\n");
+ return -ENOMEM;
+ }
+ d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
+
+ debugfs_create_u32("kick_stats", 0644, d_spin_debug, &kick_stats);
+ debugfs_create_u32("kick_nohalt_stats",
+ 0644, d_spin_debug, &kick_nohalt_stats);
+ debugfs_create_u32("halt_qhead_stats",
+ 0644, d_spin_debug, &halt_qhead_stats);
+ debugfs_create_u32("halt_qnode_stats",
+ 0644, d_spin_debug, &halt_qnode_stats);
+ debugfs_create_u32("wake_kick_stats",
+ 0644, d_spin_debug, &wake_kick_stats);
+ debugfs_create_u32("wake_spur_stats",
+ 0644, d_spin_debug, &wake_spur_stats);
+ return 0;
+}
+
+static inline void kvm_kick_stats(void)
+{
+ add_smp(&kick_stats, 1);
+}
+
+static inline void kvm_halt_stats(enum pv_lock_stats type)
+{
+ if (type == PV_HALT_QHEAD)
+ add_smp(&halt_qhead_stats, 1);
+ else /* type == PV_HALT_QNODE */
+ add_smp(&halt_qnode_stats, 1);
+}
+
+static inline void kvm_lock_stats(enum pv_lock_stats type)
+{
+ if (type == PV_WAKE_KICKED)
+ add_smp(&wake_kick_stats, 1);
+ else if (type == PV_WAKE_SPURIOUS)
+ add_smp(&wake_spur_stats, 1);
+ else /* type == PV_KICK_NOHALT */
+ add_smp(&kick_nohalt_stats, 1);
+}
+
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+static inline void kvm_kick_stats(void)
+{
+}
+
+static inline void kvm_halt_stats(enum pv_lock_stats type)
+{
+}
+
+static inline void kvm_lock_stats(enum pv_lock_stats type)
+{
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+static void kvm_kick_cpu_stats(int cpu)
+{
+ kvm_kick_cpu(cpu);
+ kvm_kick_stats();
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void kvm_hibernate(enum pv_lock_stats type)
+{
+ unsigned long flags;
+
+ if (in_nmi())
+ return;
+
+ kvm_halt_stats(type);
+ /*
+ * Make sure an interrupt handler can't upset things in a
+ * partially setup state.
+ */
+ local_irq_save(flags);
+ if (arch_irqs_disabled_flags(flags))
+ halt();
+ else
+ safe_halt();
+ local_irq_restore(flags);
+}
+#endif /* !CONFIG_QUEUE_SPINLOCK */

/*
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -807,8 +912,14 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;

+#ifdef CONFIG_QUEUE_SPINLOCK
+ pv_lock_ops.kick_cpu = kvm_kick_cpu_stats;
+ pv_lock_ops.hibernate = kvm_hibernate;
+ pv_lock_ops.lockstat = kvm_lock_stats;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
}

static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index f185584..a70fdeb 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK

config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
- depends on SMP && !PARAVIRT_SPINLOCKS
+ depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
--
1.7.1

2014-04-02 13:28:42

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 06/10] pvqspinlock: Enable lock stealing in queue lock waiters

The simple unfair queue lock cannot completely solve the lock waiter
preemption problem as a preempted CPU at the front of the queue will
block forward progress in all the other CPUs behind it in the queue.
To allow those CPUs to move forward, it is necessary to enable lock
stealing for those lock waiters as well.

This patch enables those lock waiters to try to steal the lock
periodically at a frequency that is inverse algorithmatically
proportional to its distance from the queue head until it reaches a
maximum value.

Enabling those lock waiters to try to steal the lock increases the
cacheline pressure on the lock word. As a result, performance can
suffer on a workload with heavy spinlock contention.

The table below shows the the performance (jobs/minutes) of that
scheme when compared with other kernel flavors at 3000 users of the
AIM7 disk workload on a 4-socket Westmere-EX system.

Kernel disk-xfs JPM disk-ext4 JPM
------ ------------ -------------
ticketlock 5,660,377 1,151,631
qspinlock 5,678,233 2,033,898
simple test-and-set 5,678,233 533,966
simple unfair qspinlock 5,732,484 2,216,749
complex unfair qspinlock 5,625,000 707,547

With heavy spinlock contention, the complex unfair lock is faster
than the simple test-and-set lock, but it is still slower than the
baseline ticketlock.

As for the lock starvation problem, the patch tries to reduce the
chance of this problem by enabling the CPUs at the front of the queue
has a much higher chance of getting the lock than the ones behind. This
should greatly improve the chance of making forward progress in the
queue without unacceptably long delay.

The table below shows the execution times (in ms) of a spinlock
micro-benchmark on the same 4-socket Westmere-EX system.

# of Ticket Fair Unfair simple Unfair complex
tasks lock queue lock queue lock queue lock
------ ------- ---------- ---------- --------------
1 135 135 137 137
2 1045 951 732 696
3 1827 2256 915 1267
4 2689 2880 1377 1695
5 3736 3636 1439 2243
6 4942 4294 1724 2617
7 6304 4976 2001 2776
8 7736 5662 2317 2941

Executing one task per node, the performance data were:

# of Ticket Fair Unfair simple Unfair complex
nodes lock queue lock queue lock queue lock
------ ------- ---------- ---------- --------------
1 135 135 137 137
2 4452 1024 1697 1354
3 10767 14030 2015 2581
4 20835 10740 2732 4653

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 270 +++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 265 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cf16bba..527efc3 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -86,13 +86,14 @@ enum exitval {

/*
* The queue node structure
- *
- * This structure is essentially the same as the mcs_spinlock structure
- * in mcs_spinlock.h file. It is retained for future extension where new
- * fields may be added.
*/
struct qnode {
u32 qhead; /* Queue head flag */
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+ int lsteal_mask; /* Lock stealing frequency mask */
+ u32 prev_qcode; /* Queue code of previous node */
+ struct qnode *qprev; /* Previous queue node addr */
+#endif
struct qnode *next; /* Next queue node addr */
};

@@ -107,6 +108,11 @@ struct qnode_set {
static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };

/*
+ * Macro to get queue code from queue spinlock
+ */
+#define queue_get_qcode(lock) qsval_to_qcode(atomic_read(&(lock)->qlcode))
+
+/*
************************************************************************
* The following optimized codes are for architectures that support: *
* 1) Atomic byte and short data write *
@@ -279,6 +285,22 @@ queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
return NORMAL_EXIT;
}

+#define cmpxchg_queue_code cmpxchg_queue_code
+/**
+ * cmpxchg_queue_code - compare and exchange a queue code value in the lock
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code value
+ * @ncode: New queue code value
+ * Return: true if compare and exchange successful, false otherwise
+ */
+static inline int
+cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ return cmpxchg(&qlock->qcode, (u16)ocode, (u16)ncode) == (u16)ocode;
+}
+
#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
/**
* queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
@@ -449,6 +471,223 @@ queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
}
#endif /* queue_code_xchg */

+#ifndef cmpxchg_queue_code
+/**
+ * cmpxchg_queue_code - compare and exchange a queue code value in the lock
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code value
+ * @ncode: New queue code value
+ * Return: true if compare and exchange successful, false otherwise
+ */
+static inline int
+cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode)
+{
+ u32 lockval = atomic_read(lock->qlcode) & _QLOCK_LOCK_MASK;
+
+ ocode |= lockval;
+ ncode |= lockval;
+ return atomic_cmpxchg(&lock->qlcode, ocode, ncode) == ocode;
+}
+#endif /* cmpxchg_queue_code */
+
+/*
+ ************************************************************************
+ * Inline functions for supporting unfair queue lock *
+ ************************************************************************
+ */
+/*
+ * Unfair lock support in a paravirtualized guest
+ *
+ * An unfair lock can be implemented using a simple test-and-set lock like
+ * what is being done in a read-write lock. This simple scheme has 2 major
+ * problems:
+ * 1) It needs constant reading and occasionally writing to the lock word
+ * thus putting a lot of cacheline contention traffic on the affected
+ * cacheline.
+ * 2) Lock starvation is a real possibility especially if the number of
+ * virtual CPUs is large.
+ *
+ * To reduce the undesirable side effects of an unfair lock, the queue
+ * unfair spinlock implements a more elaborate scheme. Lock stealing is
+ * allowed in the following places:
+ * 1) In the spin_lock and spin_trylock functiopns
+ * 2) When spinning in the waiter queue before becoming the queue head
+ *
+ * A lock acquirer has only one chance of stealing the lock in the spin_lock
+ * and spin_trylock function. If the attempt fails for spin_lock, the task
+ * will be queued in the wait queue.
+ *
+ * Even in the wait queue, the task can still attempt to steal the lock
+ * periodically at a frequency about inversely and logarithmically proportional
+ * to its distance from the queue head. In other word, the closer it is to
+ * the queue head, the higher a chance it has of stealing the lock. This
+ * scheme reduce the load on the lock cacheline while trying to maintain
+ * a somewhat FIFO way of getting the lock so as to reduce the chance of lock
+ * starvation.
+ */
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+#define DEF_LOOP_CNT(c) int c = 0
+#define INC_LOOP_CNT(c) (c)++
+#define LOOP_CNT(c) c
+#define LSTEAL_MIN (1 << 3)
+#define LSTEAL_MAX (1 << 10)
+#define LSTEAL_MIN_MASK (LSTEAL_MIN - 1)
+#define LSTEAL_MAX_MASK (LSTEAL_MAX - 1)
+
+/**
+ * unfair_init_vars - initialize unfair relevant fields in queue node structure
+ * @node: Current queue node address
+ */
+static void unfair_init_vars(struct qnode *node)
+{
+ node->qprev = NULL;
+ node->prev_qcode = 0;
+ node->lsteal_mask = LSTEAL_MIN_MASK;
+}
+
+/**
+ * unfair_set_vars - set unfair related fields in the queue node structure
+ * @node : Current queue node address
+ * @prev : Previous queue node address
+ * @prev_qcode: Previous qcode value
+ */
+static void
+unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_qcode)
+{
+ if (!static_key_false(&paravirt_unfairlocks_enabled))
+ return;
+
+ node->qprev = prev;
+ node->prev_qcode = prev_qcode;
+ /*
+ * This node will spin double the number of time of the previous node
+ * before attempting to steal the lock until it reaches a maximum.
+ */
+ node->lsteal_mask = prev->qhead ? LSTEAL_MIN_MASK :
+ (prev->lsteal_mask << 1) + 1;
+ if (node->lsteal_mask > LSTEAL_MAX_MASK)
+ node->lsteal_mask = LSTEAL_MAX_MASK;
+ /* Make sure the new fields are visible to others */
+ smp_wmb();
+}
+
+/**
+ * unfair_check_qcode - check the current to see if it is the last one
+ * and need to be cleared.
+ * @lock : Pointer to queue spinlock structure
+ * @my_qcode: My queue code value to be checked again
+ * Return : Either RELEASE_NODE or NOTIFY_NEXT
+ */
+static enum exitval unfair_check_qcode(struct qspinlock *lock, u32 my_qcode)
+{
+ if (!static_key_false(&paravirt_unfairlocks_enabled))
+ return NOTIFY_NEXT;
+
+ /*
+ * Try to clear the current queue code if it match my_qcode
+ */
+ if ((queue_get_qcode(lock) == my_qcode) &&
+ cmpxchg_queue_code(lock, my_qcode, 0))
+ return RELEASE_NODE;
+ return NOTIFY_NEXT;
+}
+
+/**
+ * unfair_get_lock - try to steal the lock periodically
+ * @lock : Pointer to queue spinlock structure
+ * @node : Current queue node address
+ * @my_qcode: My queue code value
+ * @count : Loop count
+ * Return : NORMAL_EXIT, RELEASE_NODE or NOTIFY_NEXT
+ */
+static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node,
+ u32 my_qcode, int count)
+{
+ u32 pqcode;
+ int qhead;
+ struct qnode *next;
+
+ if (!static_key_false(&paravirt_unfairlocks_enabled) ||
+ ((count & node->lsteal_mask) != node->lsteal_mask))
+ return NORMAL_EXIT;
+
+ if (!queue_spin_trylock_unfair(lock)) {
+ /*
+ * Lock stealing fails, re-adjust the lsteal mask.
+ */
+ struct qnode *prev = node->qprev;
+
+ node->lsteal_mask = prev->qhead ? LSTEAL_MIN_MASK :
+ (prev->lsteal_mask << 1) + 1;
+ if (node->lsteal_mask > LSTEAL_MAX_MASK)
+ node->lsteal_mask = LSTEAL_MAX_MASK;
+ return NORMAL_EXIT;
+ }
+
+ /*
+ * Have stolen the lock, need to remove itself from the wait queue
+ * 1) If it is at the end of the queue, change the qcode in the lock
+ * word to the one before it.
+ * 2) Change the next pointer in the previous queue node to point
+ * to the next one in queue or NULL if it is at the end of queue.
+ * 3) If a next node is present, copy the prev_qcode and qprev values
+ * to the next node.
+ */
+ qhead = ACCESS_ONCE(node->qhead);
+ pqcode = qhead ? 0 : node->prev_qcode;
+
+ if ((queue_get_qcode(lock) == my_qcode) &&
+ cmpxchg_queue_code(lock, my_qcode, pqcode)) {
+ /*
+ * Successfully change the qcode back to the previous one.
+ * Now need to clear the next pointer in the previous node
+ * only if it contains my queue node address. Whether the
+ * cmpxchg() call below fails or succeeds doesn't really
+ * matter.
+ */
+ (void)cmpxchg(&node->qprev->next, node, NULL);
+ return RELEASE_NODE;
+ }
+
+ next = ACCESS_ONCE(node->next);
+ if (unlikely(!next))
+ /* Wait until the next pointer is set */
+ do {
+ arch_mutex_cpu_relax();
+ } while (!(next = ACCESS_ONCE(node->next)));
+
+ if (!qhead) {
+ /*
+ * Change the node data only if it is not the queue head
+ */
+ ACCESS_ONCE(node->qprev->next) = next;
+ ACCESS_ONCE(next->qprev) = node->qprev;
+ ACCESS_ONCE(next->prev_qcode) = node->prev_qcode;
+
+ /*
+ * Make sure all the new node information are visible
+ * before proceeding.
+ */
+ smp_wmb();
+ return RELEASE_NODE;
+ }
+ return NOTIFY_NEXT;
+}
+
+#else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+#define DEF_LOOP_CNT(c)
+#define INC_LOOP_CNT(c)
+#define LOOP_CNT(c) 0
+
+static void unfair_init_vars(struct qnode *node) {}
+static void unfair_set_vars(struct qnode *node, struct qnode *prev,
+ u32 prev_qcode) {}
+static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node,
+ u32 my_qcode, int count) { return NORMAL_EXIT; }
+static enum exitval unfair_check_qcode(struct qspinlock *lock, u32 my_qcode)
+ { return NORMAL_EXIT; }
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
/*
************************************************************************
* Other inline functions needed by the queue_spin_lock_slowpath() *
@@ -525,13 +764,22 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
* and set up the "next" fields of the that node.
*/
struct qnode *prev = xlate_qcode(prev_qcode);
+ DEF_LOOP_CNT(cnt);

+ unfair_set_vars(node, prev, prev_qcode);
ACCESS_ONCE(prev->next) = node;
/*
* Wait until the queue head flag is on
*/
do {
+ INC_LOOP_CNT(cnt);
arch_mutex_cpu_relax();
+ exitval = unfair_get_lock(lock, node, my_qcode,
+ LOOP_CNT(cnt));
+ if (unlikely(exitval == RELEASE_NODE))
+ goto release_node;
+ else if (unlikely(exitval == NOTIFY_NEXT))
+ goto notify_next;
} while (!ACCESS_ONCE(node->qhead));
}

@@ -540,7 +788,6 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
*/
for (;; arch_mutex_cpu_relax()) {
qsval = atomic_read(&lock->qlcode);
- next = ACCESS_ONCE(node->next);
if (qsval & _QLOCK_LOCK_MASK)
continue; /* Lock not available yet */

@@ -550,6 +797,18 @@ static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
*/
if (unlikely(!__queue_spin_trylock(lock)))
continue; /* Trylock fails! */
+
+ next = ACCESS_ONCE(node->next);
+ /*
+ * The unfair lock stealing code may cause the
+ * next node, whose qcode is in the lock, to get
+ * the lock first and thus don't need to be notify.
+ * Need to recheck the qcode value in the lock.
+ */
+ if (unlikely(unfair_check_qcode(lock, my_qcode)
+ == RELEASE_NODE))
+ goto release_node;
+
if (likely(next))
goto set_qhead;
else
@@ -611,6 +870,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
*/
node->qhead = false;
node->next = NULL;
+ unfair_init_vars(node);

/*
* The lock may be available at this point, try again if no task was
--
1.7.1

2014-04-02 13:31:48

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 05/10] pvqspinlock, x86: Allow unfair spinlock in a PV guest

Locking is always an issue in a virtualized environment because of 2
different types of problems:
1) Lock holder preemption
2) Lock waiter preemption

One solution to the lock waiter preemption problem is to allow unfair
lock in a para-virtualized environment. In this case, a new lock
acquirer can come and steal the lock if the next-in-line CPU to get
the lock is scheduled out.

A simple unfair lock is the test-and-set byte lock where an lock
acquirer constantly spins on the lock word and attempt to grab it
when the lock is freed. This simple unfair lock has 2 main problems:
1) The constant spinning on the lock word put a lot of cacheline
contention traffic on the affected cacheline, thus slowing tasks
that need to access the cacheline.
2) Lock starvation is a real possibility especially if the number of
virtual CPUs is large.

A simple unfair queue spinlock can be implemented by allowing lock
stealing in the fast path. The slowpath will still be the same as
before and all the pending lock acquirers will have to wait in the
queue in FIFO order. This cannot completely solve the lock waiter
preemption problem, but it does help to alleviate the impact of
this problem.

To illustrate the performance impact of the various approaches, the
disk workload of the AIM7 benchmark was run on a 4-socket 40-core
Westmere-EX system (bare metal, HT off, ramdisk) on a 3.14-rc5
based kernel. The table below shows the performance (jobs/minutes)
of the different kernel flavors.

Kernel disk-xfs JPM disk-ext4 JPM
------ ------------ -------------
ticketlock 5,660,377 1,151,631
qspinlock 5,678,233 2,033,898
simple test-and-set 5,678,233 533,966
simple unfair qspinlock 5,732,484 2,216,749

The disk-xfs workload spent only about 2.88% of CPU time in
_raw_spin_lock() whereas the disk-ext4 workload spent 57.8% of CPU
time in _raw_spin_lock(). It can be seen that there wasn't too much
difference in performance with low spinlock contention in the disk-xfs
workload. With heavy spinlock contention, the simple test-and-set
lock is only half the performance of the baseline ticketlock. The
simple unfair qspinlock, on the other hand, is almost double the
performance of the ticketlock.

Unfair lock in a native environment is generally not a good idea as
there is a possibility of lock starvation for a heavily contended lock.

This patch adds a new configuration option for the x86 architecture
to enable the use of unfair queue spinlock (PARAVIRT_UNFAIR_LOCKS) in
a para-virtualized guest. A jump label (paravirt_unfairlocks_enabled)
is used to switch between a fair and an unfair version of the spinlock
code. This jump label will only be enabled in a PV guest where the
X86_FEATURE_HYPERVISOR feature bit is set.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

# of Ticket Fair Unfair simple Unfair
tasks lock queue lock queue lock byte lock
------ ------- ---------- ---------- ---------
1 135 135 137 137
2 1045 951 732 462
3 1827 2256 915 963
4 2689 2880 1377 1706
5 3736 3636 1439 2127
6 4942 4294 1724 2980
7 6304 4976 2001 3491
8 7736 5662 2317 3955

Executing one task per node, the performance data were:

# of Ticket Fair Unfair simple Unfair
nodes lock queue lock queue lock byte lock
------ ------- ---------- ---------- ---------
1 135 135 137 137
2 4452 1024 1697 710
3 10767 14030 2015 1468
4 20835 10740 2732 2582

In general, the shorter the critical section, the better the
performance benefit of an unfair lock. For large critical section,
however, there may not be much benefit.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 11 ++++
arch/x86/include/asm/qspinlock.h | 86 +++++++++++++++++++++++++++++++++-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/paravirt-spinlocks.c | 26 ++++++++++
4 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index de573f9..010abc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -629,6 +629,17 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer Y.

+config PARAVIRT_UNFAIR_LOCKS
+ bool "Enable unfair locks in a para-virtualized guest"
+ depends on PARAVIRT && SMP && QUEUE_SPINLOCK
+ depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
+ ---help---
+ This changes the kernel to use unfair locks in a
+ para-virtualized guest. This will help performance in most
+ cases. However, there is a possibility of lock starvation
+ on a heavily contended lock especially in a large guest
+ with many virtual CPUs.
+
source "arch/x86/xen/Kconfig"

config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 265b10b..d91994d 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -28,6 +28,10 @@ union arch_qspinlock {
u32 qlcode; /* Complete lock word */
};

+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+extern struct static_key paravirt_unfairlocks_enabled;
+#endif
+
#define queue_spin_unlock queue_spin_unlock
/**
* queue_spin_unlock - release a queue spinlock
@@ -52,15 +56,23 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
/**
* __queue_spin_trylock - acquire the lock by setting the lock bit
* @lock: Pointer to queue spinlock structure
- * Return: Always return 1
+ * Return: 1 if lock acquired, 0 otherwise
*
* This routine should only be called when the caller is the only one
- * entitled to acquire the lock. No lock stealing is allowed.
+ * entitled to acquire the lock.
*/
static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
{
union arch_qspinlock *qlock = (union arch_qspinlock *)lock;

+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+ if (static_key_false(&paravirt_unfairlocks_enabled))
+ /*
+ * Need to use atomic operation to get the lock when
+ * lock stealing can happen.
+ */
+ return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0;
+#endif
barrier();
ACCESS_ONCE(qlock->lock) = _QLOCK_LOCKED;
barrier();
@@ -71,4 +83,74 @@ static __always_inline int __queue_spin_trylock(struct qspinlock *lock)

#include <asm-generic/qspinlock.h>

+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ if (likely(cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0))
+ return;
+ /*
+ * Since the lock is now unfair, we should not activate the 2-task
+ * quick spinning code path which disallows lock stealing.
+ */
+ queue_spin_lock_slowpath(lock, -1);
+}
+
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ if (!qlock->lock && (cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+ if (static_key_false(&paravirt_unfairlocks_enabled))
+ queue_spin_lock_unfair(lock);
+ else
+ queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+ if (static_key_false(&paravirt_unfairlocks_enabled))
+ return queue_spin_trylock_unfair(lock);
+ else
+ return queue_spin_trylock(lock);
+}
+
+#define arch_spin_lock_flags(l, f) arch_spin_lock(l)
+
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..1107a20 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o
obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o

obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..7dfd02d 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,6 +8,7 @@

#include <asm/paravirt.h>

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
@@ -18,3 +19,28 @@ EXPORT_SYMBOL(pv_lock_ops);

struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+#endif
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
+
+#include <linux/init.h>
+#include <asm/cpufeature.h>
+
+/*
+ * Enable unfair lock only if it is running under a hypervisor
+ */
+static __init int unfair_locks_init_jump(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
+ return 0;
+
+ static_key_slow_inc(&paravirt_unfairlocks_enabled);
+ printk(KERN_INFO "Unfair spinlock enabled\n");
+
+ return 0;
+}
+early_initcall(unfair_locks_init_jump);
+
+#endif
--
1.7.1

2014-04-02 13:34:18

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 02/10] qspinlock, x86: Enable x86-64 to use queue spinlock

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/qspinlock.h | 41 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/spinlock.h | 5 ++++
arch/x86/include/asm/spinlock_types.h | 4 +++
4 files changed, 51 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..de573f9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+ select ARCH_USE_QUEUE_SPINLOCK

### Arch settings
config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..44cefee
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+
+/*
+ * x86-64 specific queue spinlock union structure
+ */
+union arch_qspinlock {
+ struct qspinlock slock;
+ u8 lock; /* Lock bit */
+};
+
+#define queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+ barrier();
+ ACCESS_ONCE(qlock->lock) = 0;
+ barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..958d20f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
{
arch_spin_lock(lock);
}
+#endif /* CONFIG_QUEUE_SPINLOCK */

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;

#define TICKET_SHIFT (sizeof(__ticket_t) * 8)

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
} arch_spinlock_t;

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */

#include <asm/rwlock.h>

--
1.7.1

2014-04-02 14:33:07

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> N.B. Sorry for the duplicate. This patch series were resent as the
> original one was rejected by the vger.kernel.org list server
> due to long header. There is no change in content.
>
> v7->v8:
> - Remove one unneeded atomic operation from the slowpath, thus
> improving performance.
> - Simplify some of the codes and add more comments.
> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> unfair lock.
> - Reduce unfair lock slowpath lock stealing frequency depending
> on its distance from the queue head.
> - Add performance data for IvyBridge-EX CPU.

FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
HVM guest under Xen after a while stops working. The workload
is doing 'make -j32' on the Linux kernel.

Completely unresponsive. Thoughts?

(CC ing Marcos who had run the test)
>
> v6->v7:
> - Remove an atomic operation from the 2-task contending code
> - Shorten the names of some macros
> - Make the queue waiter to attempt to steal lock when unfair lock is
> enabled.
> - Remove lock holder kick from the PV code and fix a race condition
> - Run the unfair lock & PV code on overcommitted KVM guests to collect
> performance data.
>
> v5->v6:
> - Change the optimized 2-task contending code to make it fairer at the
> expense of a bit of performance.
> - Add a patch to support unfair queue spinlock for Xen.
> - Modify the PV qspinlock code to follow what was done in the PV
> ticketlock.
> - Add performance data for the unfair lock as well as the PV
> support code.
>
> v4->v5:
> - Move the optimized 2-task contending code to the generic file to
> enable more architectures to use it without code duplication.
> - Address some of the style-related comments by PeterZ.
> - Allow the use of unfair queue spinlock in a real para-virtualized
> execution environment.
> - Add para-virtualization support to the qspinlock code by ensuring
> that the lock holder and queue head stay alive as much as possible.
>
> v3->v4:
> - Remove debugging code and fix a configuration error
> - Simplify the qspinlock structure and streamline the code to make it
> perform a bit better
> - Add an x86 version of asm/qspinlock.h for holding x86 specific
> optimization.
> - Add an optimized x86 code path for 2 contending tasks to improve
> low contention performance.
>
> v2->v3:
> - Simplify the code by using numerous mode only without an unfair option.
> - Use the latest smp_load_acquire()/smp_store_release() barriers.
> - Move the queue spinlock code to kernel/locking.
> - Make the use of queue spinlock the default for x86-64 without user
> configuration.
> - Additional performance tuning.
>
> v1->v2:
> - Add some more comments to document what the code does.
> - Add a numerous CPU mode to support >= 16K CPUs
> - Add a configuration option to allow lock stealing which can further
> improve performance in many cases.
> - Enable wakeup of queue head CPU at unlock time for non-numerous
> CPU mode.
>
> This patch set has 3 different sections:
> 1) Patches 1-4: Introduces a queue-based spinlock implementation that
> can replace the default ticket spinlock without increasing the
> size of the spinlock data structure. As a result, critical kernel
> data structures that embed spinlock won't increase in size and
> break data alignments.
> 2) Patches 5-6: Enables the use of unfair queue spinlock in a
> para-virtualized execution environment. This can resolve some
> of the locking related performance issues due to the fact that
> the next CPU to get the lock may have been scheduled out for a
> period of time.
> 3) Patches 7-10: Enable qspinlock para-virtualization support
> by halting the waiting CPUs after spinning for a certain amount of
> time. The unlock code will detect the a sleeping waiter and wake it
> up. This is essentially the same logic as the PV ticketlock code.
>
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention. This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
>
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
>
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
>
> To illustrate the performance benefit of the queue spinlock, the
> ebizzy benchmark was run with the -m option in two different computers:
>
> Test machine ticket-lock queue-lock
> ------------ ----------- ----------
> 4-socket 40-core 2316 rec/s 2899 rec/s
> Westmere-EX (HT off)
> 2-socket 12-core 2130 rec/s 2176 rec/s
> Westmere-EP (HT on)
>
> Waiman Long (10):
> qspinlock: A generic 4-byte queue spinlock implementation
> qspinlock, x86: Enable x86-64 to use queue spinlock
> qspinlock: More optimized code for smaller NR_CPUS
> qspinlock: Optimized code path for 2 contending tasks
> pvqspinlock, x86: Allow unfair spinlock in a PV guest
> pvqspinlock: Enable lock stealing in queue lock waiters
> pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
> pvqspinlock, x86: Add qspinlock para-virtualization support
> pvqspinlock, x86: Enable qspinlock PV support for KVM
> pvqspinlock, x86: Enable qspinlock PV support for XEN
>
> arch/x86/Kconfig | 12 +
> arch/x86/include/asm/paravirt.h | 17 +-
> arch/x86/include/asm/paravirt_types.h | 16 +
> arch/x86/include/asm/pvqspinlock.h | 260 +++++++++
> arch/x86/include/asm/qspinlock.h | 191 +++++++
> arch/x86/include/asm/spinlock.h | 9 +-
> arch/x86/include/asm/spinlock_types.h | 4 +
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/kvm.c | 113 ++++-
> arch/x86/kernel/paravirt-spinlocks.c | 36 ++-
> arch/x86/xen/spinlock.c | 121 ++++-
> include/asm-generic/qspinlock.h | 126 ++++
> include/asm-generic/qspinlock_types.h | 63 ++
> kernel/Kconfig.locks | 7 +
> kernel/locking/Makefile | 1 +
> kernel/locking/qspinlock.c | 1010 +++++++++++++++++++++++++++++++++
> 16 files changed, 1975 insertions(+), 12 deletions(-)
> create mode 100644 arch/x86/include/asm/pvqspinlock.h
> create mode 100644 arch/x86/include/asm/qspinlock.h
> create mode 100644 include/asm-generic/qspinlock.h
> create mode 100644 include/asm-generic/qspinlock_types.h
> create mode 100644 kernel/locking/qspinlock.c
>

2014-04-02 14:40:26

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN

> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index a70fdeb..451e392 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>
> config QUEUE_SPINLOCK
> def_bool y if ARCH_USE_QUEUE_SPINLOCK
> - depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
> + depends on SMP

If I read this correctly that means you cannot select any more the old
ticketlocks? As in, if you select CONFIG_PARAVIRT on X86 it will automatically
select ARCH_USE_QUEUE_SPINLOCK which will then enable this by default?

Should the 'def_bool' be selectable?

> --
> 1.7.1
>

2014-04-02 20:35:47

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
>> N.B. Sorry for the duplicate. This patch series were resent as the
>> original one was rejected by the vger.kernel.org list server
>> due to long header. There is no change in content.
>>
>> v7->v8:
>> - Remove one unneeded atomic operation from the slowpath, thus
>> improving performance.
>> - Simplify some of the codes and add more comments.
>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
>> unfair lock.
>> - Reduce unfair lock slowpath lock stealing frequency depending
>> on its distance from the queue head.
>> - Add performance data for IvyBridge-EX CPU.
> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> HVM guest under Xen after a while stops working. The workload
> is doing 'make -j32' on the Linux kernel.
>
> Completely unresponsive. Thoughts?
>

Thank for reporting that. I haven't done that much testing on Xen. My
focus was in KVM. I will perform more test on Xen to see if I can
reproduce the problem.

-Longman

2014-04-02 20:39:12

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN

On 04/02/2014 10:39 AM, Konrad Rzeszutek Wilk wrote:
>> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
>> index a70fdeb..451e392 100644
>> --- a/kernel/Kconfig.locks
>> +++ b/kernel/Kconfig.locks
>> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>>
>> config QUEUE_SPINLOCK
>> def_bool y if ARCH_USE_QUEUE_SPINLOCK
>> - depends on SMP&& (!PARAVIRT_SPINLOCKS || !XEN)
>> + depends on SMP
> If I read this correctly that means you cannot select any more the old
> ticketlocks? As in, if you select CONFIG_PARAVIRT on X86 it will automatically
> select ARCH_USE_QUEUE_SPINLOCK which will then enable this by default?
>
> Should the 'def_bool' be selectable?

My initial qspinlock patch allows it to be selectable, but there was
comment that doing so would increase the testing matrix and so made it
less desirable. That was why I made it the default for 64-bit code which
are more likely to be used in large machines which benefit the most from
this patch. 32-bit kernel will still use the ticket lock.

-Longman

2014-04-03 02:10:52

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/02/2014 04:35 PM, Waiman Long wrote:
> On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
>>> N.B. Sorry for the duplicate. This patch series were resent as the
>>> original one was rejected by the vger.kernel.org list server
>>> due to long header. There is no change in content.
>>>
>>> v7->v8:
>>> - Remove one unneeded atomic operation from the slowpath, thus
>>> improving performance.
>>> - Simplify some of the codes and add more comments.
>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
>>> unfair lock.
>>> - Reduce unfair lock slowpath lock stealing frequency depending
>>> on its distance from the queue head.
>>> - Add performance data for IvyBridge-EX CPU.
>> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
>> HVM guest under Xen after a while stops working. The workload
>> is doing 'make -j32' on the Linux kernel.
>>
>> Completely unresponsive. Thoughts?
>>
>
> Thank for reporting that. I haven't done that much testing on Xen. My
> focus was in KVM. I will perform more test on Xen to see if I can
> reproduce the problem.
>

BTW, does the halting and sending IPI mechanism work in HVM? I saw that
in RHEL7, PV spinlock was explicitly disabled when in HVM mode. However,
this piece of code isn't in upstream code. So I wonder if there is
problem with that.

-Longman

2014-04-03 17:24:34

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
> On 04/02/2014 04:35 PM, Waiman Long wrote:
> >On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
> >>On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> >>>N.B. Sorry for the duplicate. This patch series were resent as the
> >>> original one was rejected by the vger.kernel.org list server
> >>> due to long header. There is no change in content.
> >>>
> >>>v7->v8:
> >>> - Remove one unneeded atomic operation from the slowpath, thus
> >>> improving performance.
> >>> - Simplify some of the codes and add more comments.
> >>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> >>> unfair lock.
> >>> - Reduce unfair lock slowpath lock stealing frequency depending
> >>> on its distance from the queue head.
> >>> - Add performance data for IvyBridge-EX CPU.
> >>FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> >>HVM guest under Xen after a while stops working. The workload
> >>is doing 'make -j32' on the Linux kernel.
> >>
> >>Completely unresponsive. Thoughts?
> >>
> >
> >Thank for reporting that. I haven't done that much testing on Xen.
> >My focus was in KVM. I will perform more test on Xen to see if I
> >can reproduce the problem.
> >
>
> BTW, does the halting and sending IPI mechanism work in HVM? I saw

Yes.
> that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
> However, this piece of code isn't in upstream code. So I wonder if
> there is problem with that.

The PV ticketlock fixed it for HVM. It was disabled before because
the PV guests were using bytelocks while the HVM were using ticketlocks
and you couldnt' swap in PV bytelocks for ticketlocks during startup.

>
> -Longman

2014-04-04 02:57:31

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
>> On 04/02/2014 04:35 PM, Waiman Long wrote:
>>> On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
>>>>> N.B. Sorry for the duplicate. This patch series were resent as the
>>>>> original one was rejected by the vger.kernel.org list server
>>>>> due to long header. There is no change in content.
>>>>>
>>>>> v7->v8:
>>>>> - Remove one unneeded atomic operation from the slowpath, thus
>>>>> improving performance.
>>>>> - Simplify some of the codes and add more comments.
>>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
>>>>> unfair lock.
>>>>> - Reduce unfair lock slowpath lock stealing frequency depending
>>>>> on its distance from the queue head.
>>>>> - Add performance data for IvyBridge-EX CPU.
>>>> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
>>>> HVM guest under Xen after a while stops working. The workload
>>>> is doing 'make -j32' on the Linux kernel.
>>>>
>>>> Completely unresponsive. Thoughts?
>>>>
>>> Thank for reporting that. I haven't done that much testing on Xen.
>>> My focus was in KVM. I will perform more test on Xen to see if I
>>> can reproduce the problem.
>>>
>> BTW, does the halting and sending IPI mechanism work in HVM? I saw
> Yes.
>> that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
>> However, this piece of code isn't in upstream code. So I wonder if
>> there is problem with that.
> The PV ticketlock fixed it for HVM. It was disabled before because
> the PV guests were using bytelocks while the HVM were using ticketlocks
> and you couldnt' swap in PV bytelocks for ticketlocks during startup.

The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
kernel for all configurations. So PV ticketlock as well as Xen and KVM
support was compiled in. I think booting the kernel on bare metal will
cause the Xen code to work in HVM mode thus activating the PV spinlock
code which has a negative impact on performance. That may be why it was
disabled so that the bare metal performance will not be impacted.

BTW, could you send me more information about the configuration of the
machine, like the .config file that you used?

-Longman

2014-04-04 13:00:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation


So I'm just not ever going to pick up this patch; I spend a week trying
to reverse engineer this; I posted a 7 patch series creating the
equivalent, but in a gradual and readable fashion:

http://lkml.kernel.org/r/[email protected]

You keep on ignoring that; I'll keep on ignoring your patches.

I might at some point rewrite some of your pv stuff on top to get this
moving again, but I'm not really motivated to work with you atm.

2014-04-04 14:59:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On 04/04/2014 09:00 AM, Peter Zijlstra wrote:
> So I'm just not ever going to pick up this patch; I spend a week trying
> to reverse engineer this; I posted a 7 patch series creating the
> equivalent, but in a gradual and readable fashion:
>
> http://lkml.kernel.org/r/[email protected]
>
> You keep on ignoring that; I'll keep on ignoring your patches.
>
> I might at some point rewrite some of your pv stuff on top to get this
> moving again, but I'm not really motivated to work with you atm.

Peter, I am sorry that I have focused recently on making the qspinlock
patch works with virtualization and it is easier for me to based off on
my patch initially. Now the PV part is almost done, I will apply them on
top of your patch. Hopefully, I will get a new patch out sometime next week.

I am really sorry if you have bad feeling about it. I do not mean to
discredit you on your effort to make the qspinlock patch better. I
really appreciate your input and would like to work with you on this
patch as well as other future patches.

-Longman

2014-04-04 15:26:48

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Wed, Apr 02, 2014 at 10:32:01AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> > N.B. Sorry for the duplicate. This patch series were resent as the
> > original one was rejected by the vger.kernel.org list server
> > due to long header. There is no change in content.
> >
> > v7->v8:
> > - Remove one unneeded atomic operation from the slowpath, thus
> > improving performance.
> > - Simplify some of the codes and add more comments.
> > - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> > unfair lock.
> > - Reduce unfair lock slowpath lock stealing frequency depending
> > on its distance from the queue head.
> > - Add performance data for IvyBridge-EX CPU.
>
> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> HVM guest under Xen after a while stops working. The workload
> is doing 'make -j32' on the Linux kernel.
>
> Completely unresponsive. Thoughts?

Each VCPU seems to be stuck with this stack trace:

rip: ffffffff810013a8 xen_hypercall_sched_op+0x8
flags: 00000002 nz
rsp: ffff88029f13fb98
rax: 0000000000000000 rcx: 00000000fffffffa rdx: 0000000000000000
rbx: 0000000000000000 rsi: ffff88029f13fba8 rdi: 0000000000000003
rbp: ffff88029f13fbd0 r8: ffff8807ee65a1c0 r9: ffff88080d800b10
r10: 00000000000048cb r11: 0000000000000000 r12: 0000000000000013
r13: 0000000000000004 r14: 0000000000000001 r15: ffffea00076a8cd0
cs: 0010 ss: 0000 ds: 0000 es: 0000
fs: 0000 @ 00002b24c3e7e380
gs: 0000 @ ffff88080e200000/0000000000000000
Code (instr addr ffffffff810013a8)
cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 0f 01 c1 <c3> cc cc cc cc cc cc cc cc cc cc


Stack:
ffffffff81352d9e 000000299f13fbb0 ffff88029f13fba4 ffff880200000001
0000000000000000 ffff88029f13fbd0 0000000000000045 ffff88029f13fbe0
ffffffff81354240 ffff88029f13fc00 ffffffff81012cb6 ffff88080f4da200
ffff88080e214b00 ffff88029f13fc48 ffffffff815e4631 0000000000000000

Call Trace:
[<ffffffff810013a8>] xen_hypercall_sched_op+0x8 <--
[<ffffffff81352d9e>] xen_poll_irq_timeout+0x3e
[<ffffffff81354240>] xen_poll_irq+0x10
[<ffffffff81012cb6>] xen_hibernate+0x46
[<ffffffff815e4631>] queue_spin_lock_slowerpath+0x84
[<ffffffff810ab96e>] queue_spin_lock_slowpath+0xee
[<ffffffff815eff8f>] _raw_spin_lock_irqsave+0x3f
[<ffffffff81144e4d>] pagevec_lru_move_fn+0x8d
[<ffffffff81144780>] __pagevec_lru_add_fn
[<ffffffff81144ed7>] __pagevec_lru_add+0x17
[<ffffffff81145540>] __lru_cache_add+0x60
[<ffffffff8114590e>] lru_cache_add+0xe
[<ffffffff8116d4ba>] page_add_new_anon_rmap+0xda
[<ffffffff81162ab1>] handle_mm_fault+0xaa1
[<ffffffff81169d42>] mmap_region+0x2c2
[<ffffffff815f3c4d>] __do_page_fault+0x18d
[<ffffffff811544e1>] vm_mmap_pgoff+0xb1
[<ffffffff815f3fdb>] do_page_fault+0x2b
[<ffffffff815f06c8>] page_fault+0x28
rip: ffffffff810013a8 xen_hypercall_sched_op+0x8


>
> (CC ing Marcos who had run the test)
> >
> > v6->v7:
> > - Remove an atomic operation from the 2-task contending code
> > - Shorten the names of some macros
> > - Make the queue waiter to attempt to steal lock when unfair lock is
> > enabled.
> > - Remove lock holder kick from the PV code and fix a race condition
> > - Run the unfair lock & PV code on overcommitted KVM guests to collect
> > performance data.
> >
> > v5->v6:
> > - Change the optimized 2-task contending code to make it fairer at the
> > expense of a bit of performance.
> > - Add a patch to support unfair queue spinlock for Xen.
> > - Modify the PV qspinlock code to follow what was done in the PV
> > ticketlock.
> > - Add performance data for the unfair lock as well as the PV
> > support code.
> >
> > v4->v5:
> > - Move the optimized 2-task contending code to the generic file to
> > enable more architectures to use it without code duplication.
> > - Address some of the style-related comments by PeterZ.
> > - Allow the use of unfair queue spinlock in a real para-virtualized
> > execution environment.
> > - Add para-virtualization support to the qspinlock code by ensuring
> > that the lock holder and queue head stay alive as much as possible.
> >
> > v3->v4:
> > - Remove debugging code and fix a configuration error
> > - Simplify the qspinlock structure and streamline the code to make it
> > perform a bit better
> > - Add an x86 version of asm/qspinlock.h for holding x86 specific
> > optimization.
> > - Add an optimized x86 code path for 2 contending tasks to improve
> > low contention performance.
> >
> > v2->v3:
> > - Simplify the code by using numerous mode only without an unfair option.
> > - Use the latest smp_load_acquire()/smp_store_release() barriers.
> > - Move the queue spinlock code to kernel/locking.
> > - Make the use of queue spinlock the default for x86-64 without user
> > configuration.
> > - Additional performance tuning.
> >
> > v1->v2:
> > - Add some more comments to document what the code does.
> > - Add a numerous CPU mode to support >= 16K CPUs
> > - Add a configuration option to allow lock stealing which can further
> > improve performance in many cases.
> > - Enable wakeup of queue head CPU at unlock time for non-numerous
> > CPU mode.
> >
> > This patch set has 3 different sections:
> > 1) Patches 1-4: Introduces a queue-based spinlock implementation that
> > can replace the default ticket spinlock without increasing the
> > size of the spinlock data structure. As a result, critical kernel
> > data structures that embed spinlock won't increase in size and
> > break data alignments.
> > 2) Patches 5-6: Enables the use of unfair queue spinlock in a
> > para-virtualized execution environment. This can resolve some
> > of the locking related performance issues due to the fact that
> > the next CPU to get the lock may have been scheduled out for a
> > period of time.
> > 3) Patches 7-10: Enable qspinlock para-virtualization support
> > by halting the waiting CPUs after spinning for a certain amount of
> > time. The unlock code will detect the a sleeping waiter and wake it
> > up. This is essentially the same logic as the PV ticketlock code.
> >
> > The queue spinlock has slightly better performance than the ticket
> > spinlock in uncontended case. Its performance can be much better
> > with moderate to heavy contention. This patch has the potential of
> > improving the performance of all the workloads that have moderate to
> > heavy spinlock contention.
> >
> > The queue spinlock is especially suitable for NUMA machines with at
> > least 2 sockets, though noticeable performance benefit probably won't
> > show up in machines with less than 4 sockets.
> >
> > The purpose of this patch set is not to solve any particular spinlock
> > contention problems. Those need to be solved by refactoring the code
> > to make more efficient use of the lock or finer granularity ones. The
> > main purpose is to make the lock contention problems more tolerable
> > until someone can spend the time and effort to fix them.
> >
> > To illustrate the performance benefit of the queue spinlock, the
> > ebizzy benchmark was run with the -m option in two different computers:
> >
> > Test machine ticket-lock queue-lock
> > ------------ ----------- ----------
> > 4-socket 40-core 2316 rec/s 2899 rec/s
> > Westmere-EX (HT off)
> > 2-socket 12-core 2130 rec/s 2176 rec/s
> > Westmere-EP (HT on)
> >
> > Waiman Long (10):
> > qspinlock: A generic 4-byte queue spinlock implementation
> > qspinlock, x86: Enable x86-64 to use queue spinlock
> > qspinlock: More optimized code for smaller NR_CPUS
> > qspinlock: Optimized code path for 2 contending tasks
> > pvqspinlock, x86: Allow unfair spinlock in a PV guest
> > pvqspinlock: Enable lock stealing in queue lock waiters
> > pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
> > pvqspinlock, x86: Add qspinlock para-virtualization support
> > pvqspinlock, x86: Enable qspinlock PV support for KVM
> > pvqspinlock, x86: Enable qspinlock PV support for XEN
> >
> > arch/x86/Kconfig | 12 +
> > arch/x86/include/asm/paravirt.h | 17 +-
> > arch/x86/include/asm/paravirt_types.h | 16 +
> > arch/x86/include/asm/pvqspinlock.h | 260 +++++++++
> > arch/x86/include/asm/qspinlock.h | 191 +++++++
> > arch/x86/include/asm/spinlock.h | 9 +-
> > arch/x86/include/asm/spinlock_types.h | 4 +
> > arch/x86/kernel/Makefile | 1 +
> > arch/x86/kernel/kvm.c | 113 ++++-
> > arch/x86/kernel/paravirt-spinlocks.c | 36 ++-
> > arch/x86/xen/spinlock.c | 121 ++++-
> > include/asm-generic/qspinlock.h | 126 ++++
> > include/asm-generic/qspinlock_types.h | 63 ++
> > kernel/Kconfig.locks | 7 +
> > kernel/locking/Makefile | 1 +
> > kernel/locking/qspinlock.c | 1010 +++++++++++++++++++++++++++++++++
> > 16 files changed, 1975 insertions(+), 12 deletions(-)
> > create mode 100644 arch/x86/include/asm/pvqspinlock.h
> > create mode 100644 arch/x86/include/asm/qspinlock.h
> > create mode 100644 include/asm-generic/qspinlock.h
> > create mode 100644 include/asm-generic/qspinlock_types.h
> > create mode 100644 kernel/locking/qspinlock.c
> >

2014-04-04 16:56:20

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
> On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
> >On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
> >>On 04/02/2014 04:35 PM, Waiman Long wrote:
> >>>On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
> >>>>On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> >>>>>N.B. Sorry for the duplicate. This patch series were resent as the
> >>>>> original one was rejected by the vger.kernel.org list server
> >>>>> due to long header. There is no change in content.
> >>>>>
> >>>>>v7->v8:
> >>>>> - Remove one unneeded atomic operation from the slowpath, thus
> >>>>> improving performance.
> >>>>> - Simplify some of the codes and add more comments.
> >>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> >>>>> unfair lock.
> >>>>> - Reduce unfair lock slowpath lock stealing frequency depending
> >>>>> on its distance from the queue head.
> >>>>> - Add performance data for IvyBridge-EX CPU.
> >>>>FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> >>>>HVM guest under Xen after a while stops working. The workload
> >>>>is doing 'make -j32' on the Linux kernel.
> >>>>
> >>>>Completely unresponsive. Thoughts?
> >>>>
> >>>Thank for reporting that. I haven't done that much testing on Xen.
> >>>My focus was in KVM. I will perform more test on Xen to see if I
> >>>can reproduce the problem.
> >>>
> >>BTW, does the halting and sending IPI mechanism work in HVM? I saw
> >Yes.
> >>that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
> >>However, this piece of code isn't in upstream code. So I wonder if
> >>there is problem with that.
> >The PV ticketlock fixed it for HVM. It was disabled before because
> >the PV guests were using bytelocks while the HVM were using ticketlocks
> >and you couldnt' swap in PV bytelocks for ticketlocks during startup.
>
> The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
> kernel for all configurations. So PV ticketlock as well as Xen and
> KVM support was compiled in. I think booting the kernel on bare
> metal will cause the Xen code to work in HVM mode thus activating
> the PV spinlock code which has a negative impact on performance.

Huh? -EPARSE

> That may be why it was disabled so that the bare metal performance
> will not be impacted.

I am not following you.
>
> BTW, could you send me more information about the configuration of
> the machine, like the .config file that you used?

Marcos, could you please send that information to Peter. Thanks!
>
> -Longman

2014-04-04 16:58:16

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
>
> So I'm just not ever going to pick up this patch; I spend a week trying
> to reverse engineer this; I posted a 7 patch series creating the
> equivalent, but in a gradual and readable fashion:
>
> http://lkml.kernel.org/r/[email protected]
>
> You keep on ignoring that; I'll keep on ignoring your patches.
>
> I might at some point rewrite some of your pv stuff on top to get this
> moving again, but I'm not really motivated to work with you atm.

Uh? Did you CC also [email protected] on your patches Peter?
I hadn't had a chance to see or comment on them :-(

2014-04-04 17:08:44

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On 04/04/2014 12:57 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
>> So I'm just not ever going to pick up this patch; I spend a week trying
>> to reverse engineer this; I posted a 7 patch series creating the
>> equivalent, but in a gradual and readable fashion:
>>
>> http://lkml.kernel.org/r/[email protected]
>>
>> You keep on ignoring that; I'll keep on ignoring your patches.
>>
>> I might at some point rewrite some of your pv stuff on top to get this
>> moving again, but I'm not really motivated to work with you atm.
> Uh? Did you CC also [email protected] on your patches Peter?
> I hadn't had a chance to see or comment on them :-(
>

Peter's patch is a rewrite of my patches 1-4, there is no PV or unfair
lock support in there.

-Longman

2014-04-04 17:13:48

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
>> On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
>>>> On 04/02/2014 04:35 PM, Waiman Long wrote:
>>>>> On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
>>>>>>> N.B. Sorry for the duplicate. This patch series were resent as the
>>>>>>> original one was rejected by the vger.kernel.org list server
>>>>>>> due to long header. There is no change in content.
>>>>>>>
>>>>>>> v7->v8:
>>>>>>> - Remove one unneeded atomic operation from the slowpath, thus
>>>>>>> improving performance.
>>>>>>> - Simplify some of the codes and add more comments.
>>>>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
>>>>>>> unfair lock.
>>>>>>> - Reduce unfair lock slowpath lock stealing frequency depending
>>>>>>> on its distance from the queue head.
>>>>>>> - Add performance data for IvyBridge-EX CPU.
>>>>>> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
>>>>>> HVM guest under Xen after a while stops working. The workload
>>>>>> is doing 'make -j32' on the Linux kernel.
>>>>>>
>>>>>> Completely unresponsive. Thoughts?
>>>>>>
>>>>> Thank for reporting that. I haven't done that much testing on Xen.
>>>>> My focus was in KVM. I will perform more test on Xen to see if I
>>>>> can reproduce the problem.
>>>>>
>>>> BTW, does the halting and sending IPI mechanism work in HVM? I saw
>>> Yes.
>>>> that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
>>>> However, this piece of code isn't in upstream code. So I wonder if
>>>> there is problem with that.
>>> The PV ticketlock fixed it for HVM. It was disabled before because
>>> the PV guests were using bytelocks while the HVM were using ticketlocks
>>> and you couldnt' swap in PV bytelocks for ticketlocks during startup.
>> The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
>> kernel for all configurations. So PV ticketlock as well as Xen and
>> KVM support was compiled in. I think booting the kernel on bare
>> metal will cause the Xen code to work in HVM mode thus activating
>> the PV spinlock code which has a negative impact on performance.
> Huh? -EPARSE
>
>> That may be why it was disabled so that the bare metal performance
>> will not be impacted.
> I am not following you.

What I am saying is that when XEN and PV spinlock is compiled into the
current upstream kernel, the PV spinlock jump label is turned on when
booted on bare metal. In other words, the PV spinlock code is active
even when they are not needed and actually slow thing down in that
situation. This is a problem and we need to find way to make sure that
the PV spinlock code won't be activated on bare metal.

-Longman

2014-04-04 17:53:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation


* Waiman Long <[email protected]> wrote:

> On 04/04/2014 09:00 AM, Peter Zijlstra wrote:
> >
> > So I'm just not ever going to pick up this patch; I spend a week
> > trying to reverse engineer this; I posted a 7 patch series
> > creating the equivalent, but in a gradual and readable fashion:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > You keep on ignoring that; I'll keep on ignoring your patches.
> >
> > I might at some point rewrite some of your pv stuff on top to get
> > this moving again, but I'm not really motivated to work with you
> > atm.
>
> Peter, I am sorry that I have focused recently on making the
> qspinlock patch works with virtualization and it is easier for me to
> based off on my patch initially. Now the PV part is almost done, I
> will apply them on top of your patch. Hopefully, I will get a new
> patch out sometime next week.

Note that it's not "a patch" that PeterZ posted, but a series of 7
finegrained patches, each properly documented and commented. Please
preserve that work, build on top of it, and don't just ignore it!

Thanks,

Ingo

2014-04-04 17:55:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation


* Waiman Long <[email protected]> wrote:

> On 04/04/2014 12:57 PM, Konrad Rzeszutek Wilk wrote:
> >On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
> >>So I'm just not ever going to pick up this patch; I spend a week trying
> >>to reverse engineer this; I posted a 7 patch series creating the
> >>equivalent, but in a gradual and readable fashion:
> >>
> >> http://lkml.kernel.org/r/[email protected]
> >>
> >>You keep on ignoring that; I'll keep on ignoring your patches.
> >>
> >>I might at some point rewrite some of your pv stuff on top to get this
> >>moving again, but I'm not really motivated to work with you atm.
> >Uh? Did you CC also [email protected] on your patches Peter?
> >I hadn't had a chance to see or comment on them :-(
> >
>
> Peter's patch is a rewrite of my patches 1-4, there is no PV or
> unfair lock support in there.

It is a fine grained split-up, which does one thing at a time, so it
all becomes reviewable and mergable (and the claimed effects become
testable!). Please use that as a base.

Thanks,

Ingo

2014-04-04 17:59:14

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
> On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
> >On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
> >>On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
> >>>On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
> >>>>On 04/02/2014 04:35 PM, Waiman Long wrote:
> >>>>>On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
> >>>>>>On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> >>>>>>>N.B. Sorry for the duplicate. This patch series were resent as the
> >>>>>>> original one was rejected by the vger.kernel.org list server
> >>>>>>> due to long header. There is no change in content.
> >>>>>>>
> >>>>>>>v7->v8:
> >>>>>>> - Remove one unneeded atomic operation from the slowpath, thus
> >>>>>>> improving performance.
> >>>>>>> - Simplify some of the codes and add more comments.
> >>>>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> >>>>>>> unfair lock.
> >>>>>>> - Reduce unfair lock slowpath lock stealing frequency depending
> >>>>>>> on its distance from the queue head.
> >>>>>>> - Add performance data for IvyBridge-EX CPU.
> >>>>>>FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> >>>>>>HVM guest under Xen after a while stops working. The workload
> >>>>>>is doing 'make -j32' on the Linux kernel.
> >>>>>>
> >>>>>>Completely unresponsive. Thoughts?
> >>>>>>
> >>>>>Thank for reporting that. I haven't done that much testing on Xen.
> >>>>>My focus was in KVM. I will perform more test on Xen to see if I
> >>>>>can reproduce the problem.
> >>>>>
> >>>>BTW, does the halting and sending IPI mechanism work in HVM? I saw
> >>>Yes.
> >>>>that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
> >>>>However, this piece of code isn't in upstream code. So I wonder if
> >>>>there is problem with that.
> >>>The PV ticketlock fixed it for HVM. It was disabled before because
> >>>the PV guests were using bytelocks while the HVM were using ticketlocks
> >>>and you couldnt' swap in PV bytelocks for ticketlocks during startup.
> >>The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
> >>kernel for all configurations. So PV ticketlock as well as Xen and
> >>KVM support was compiled in. I think booting the kernel on bare
> >>metal will cause the Xen code to work in HVM mode thus activating
> >>the PV spinlock code which has a negative impact on performance.
> >Huh? -EPARSE
> >
> >>That may be why it was disabled so that the bare metal performance
> >>will not be impacted.
> >I am not following you.
>
> What I am saying is that when XEN and PV spinlock is compiled into
> the current upstream kernel, the PV spinlock jump label is turned on
> when booted on bare metal. In other words, the PV spinlock code is

How does it turn it on? I see that the jump lables are only turned
on when the jump label is enable when it detects that it is running
under Xen or KVM. It won't turn it on under baremetal.

> active even when they are not needed and actually slow thing down in
> that situation. This is a problem and we need to find way to make
> sure that the PV spinlock code won't be activated on bare metal.

Could you explain to me which piece of code enables the jump labels
on baremetal please?
>
> -Longman

2014-04-04 18:15:58

by Marcos E. Matsunaga

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

Attached is the .config I used.

On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
>> On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
>>>> On 04/02/2014 04:35 PM, Waiman Long wrote:
>>>>> On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
>>>>>>> N.B. Sorry for the duplicate. This patch series were resent as the
>>>>>>> original one was rejected by the vger.kernel.org list server
>>>>>>> due to long header. There is no change in content.
>>>>>>>
>>>>>>> v7->v8:
>>>>>>> - Remove one unneeded atomic operation from the slowpath, thus
>>>>>>> improving performance.
>>>>>>> - Simplify some of the codes and add more comments.
>>>>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
>>>>>>> unfair lock.
>>>>>>> - Reduce unfair lock slowpath lock stealing frequency depending
>>>>>>> on its distance from the queue head.
>>>>>>> - Add performance data for IvyBridge-EX CPU.
>>>>>> FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
>>>>>> HVM guest under Xen after a while stops working. The workload
>>>>>> is doing 'make -j32' on the Linux kernel.
>>>>>>
>>>>>> Completely unresponsive. Thoughts?
>>>>>>
>>>>> Thank for reporting that. I haven't done that much testing on Xen.
>>>>> My focus was in KVM. I will perform more test on Xen to see if I
>>>>> can reproduce the problem.
>>>>>
>>>> BTW, does the halting and sending IPI mechanism work in HVM? I saw
>>> Yes.
>>>> that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
>>>> However, this piece of code isn't in upstream code. So I wonder if
>>>> there is problem with that.
>>> The PV ticketlock fixed it for HVM. It was disabled before because
>>> the PV guests were using bytelocks while the HVM were using ticketlocks
>>> and you couldnt' swap in PV bytelocks for ticketlocks during startup.
>> The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
>> kernel for all configurations. So PV ticketlock as well as Xen and
>> KVM support was compiled in. I think booting the kernel on bare
>> metal will cause the Xen code to work in HVM mode thus activating
>> the PV spinlock code which has a negative impact on performance.
> Huh? -EPARSE
>
>> That may be why it was disabled so that the bare metal performance
>> will not be impacted.
> I am not following you.
>> BTW, could you send me more information about the configuration of
>> the machine, like the .config file that you used?
> Marcos, could you please send that information to Peter. Thanks!
>> -Longman

--

Regards,

Marcos Eduardo Matsunaga

Oracle USA
Linux Engineering

?The statements and opinions expressed here are my own and do not
necessarily represent those of Oracle Corporation.?


Attachments:
config-3.14.0-qspinlockV8+ (123.00 kB)

2014-04-04 18:33:56

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On Fri, Apr 04, 2014 at 01:58:15PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
> > On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
> > >On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
> > >>On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
> > >>>On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
> > >>>>On 04/02/2014 04:35 PM, Waiman Long wrote:
> > >>>>>On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
> > >>>>>>On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
> > >>>>>>>N.B. Sorry for the duplicate. This patch series were resent as the
> > >>>>>>> original one was rejected by the vger.kernel.org list server
> > >>>>>>> due to long header. There is no change in content.
> > >>>>>>>
> > >>>>>>>v7->v8:
> > >>>>>>> - Remove one unneeded atomic operation from the slowpath, thus
> > >>>>>>> improving performance.
> > >>>>>>> - Simplify some of the codes and add more comments.
> > >>>>>>> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> > >>>>>>> unfair lock.
> > >>>>>>> - Reduce unfair lock slowpath lock stealing frequency depending
> > >>>>>>> on its distance from the queue head.
> > >>>>>>> - Add performance data for IvyBridge-EX CPU.
> > >>>>>>FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
> > >>>>>>HVM guest under Xen after a while stops working. The workload
> > >>>>>>is doing 'make -j32' on the Linux kernel.
> > >>>>>>
> > >>>>>>Completely unresponsive. Thoughts?
> > >>>>>>
> > >>>>>Thank for reporting that. I haven't done that much testing on Xen.
> > >>>>>My focus was in KVM. I will perform more test on Xen to see if I
> > >>>>>can reproduce the problem.
> > >>>>>
> > >>>>BTW, does the halting and sending IPI mechanism work in HVM? I saw
> > >>>Yes.
> > >>>>that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
> > >>>>However, this piece of code isn't in upstream code. So I wonder if
> > >>>>there is problem with that.
> > >>>The PV ticketlock fixed it for HVM. It was disabled before because
> > >>>the PV guests were using bytelocks while the HVM were using ticketlocks
> > >>>and you couldnt' swap in PV bytelocks for ticketlocks during startup.
> > >>The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
> > >>kernel for all configurations. So PV ticketlock as well as Xen and
> > >>KVM support was compiled in. I think booting the kernel on bare
> > >>metal will cause the Xen code to work in HVM mode thus activating
> > >>the PV spinlock code which has a negative impact on performance.
> > >Huh? -EPARSE
> > >
> > >>That may be why it was disabled so that the bare metal performance
> > >>will not be impacted.
> > >I am not following you.
> >
> > What I am saying is that when XEN and PV spinlock is compiled into
> > the current upstream kernel, the PV spinlock jump label is turned on
> > when booted on bare metal. In other words, the PV spinlock code is
>
> How does it turn it on? I see that the jump lables are only turned
> on when the jump label is enable when it detects that it is running
> under Xen or KVM. It won't turn it on under baremetal.

Well, it seems that it does turn it on baremetal which is an stupid mistake.

Sending a patch shortly.
>
> > active even when they are not needed and actually slow thing down in
> > that situation. This is a problem and we need to find way to make
> > sure that the PV spinlock code won't be activated on bare metal.
>
> Could you explain to me which piece of code enables the jump labels
> on baremetal please?
> >
> > -Longman

2014-04-07 06:09:59

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/02/2014 06:57 PM, Waiman Long wrote:
> N.B. Sorry for the duplicate. This patch series were resent as the
> original one was rejected by the vger.kernel.org list server
> due to long header. There is no change in content.
>
> v7->v8:
> - Remove one unneeded atomic operation from the slowpath, thus
> improving performance.
> - Simplify some of the codes and add more comments.
> - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
> unfair lock.
> - Reduce unfair lock slowpath lock stealing frequency depending
> on its distance from the queue head.
> - Add performance data for IvyBridge-EX CPU.
>
> v6->v7:
> - Remove an atomic operation from the 2-task contending code
> - Shorten the names of some macros
> - Make the queue waiter to attempt to steal lock when unfair lock is
> enabled.
> - Remove lock holder kick from the PV code and fix a race condition
> - Run the unfair lock & PV code on overcommitted KVM guests to collect
> performance data.
>
> v5->v6:
> - Change the optimized 2-task contending code to make it fairer at the
> expense of a bit of performance.
> - Add a patch to support unfair queue spinlock for Xen.
> - Modify the PV qspinlock code to follow what was done in the PV
> ticketlock.
> - Add performance data for the unfair lock as well as the PV
> support code.
>
> v4->v5:
> - Move the optimized 2-task contending code to the generic file to
> enable more architectures to use it without code duplication.
> - Address some of the style-related comments by PeterZ.
> - Allow the use of unfair queue spinlock in a real para-virtualized
> execution environment.
> - Add para-virtualization support to the qspinlock code by ensuring
> that the lock holder and queue head stay alive as much as possible.
>
> v3->v4:
> - Remove debugging code and fix a configuration error
> - Simplify the qspinlock structure and streamline the code to make it
> perform a bit better
> - Add an x86 version of asm/qspinlock.h for holding x86 specific
> optimization.
> - Add an optimized x86 code path for 2 contending tasks to improve
> low contention performance.
>
> v2->v3:
> - Simplify the code by using numerous mode only without an unfair option.
> - Use the latest smp_load_acquire()/smp_store_release() barriers.
> - Move the queue spinlock code to kernel/locking.
> - Make the use of queue spinlock the default for x86-64 without user
> configuration.
> - Additional performance tuning.
>
> v1->v2:
> - Add some more comments to document what the code does.
> - Add a numerous CPU mode to support >= 16K CPUs
> - Add a configuration option to allow lock stealing which can further
> improve performance in many cases.
> - Enable wakeup of queue head CPU at unlock time for non-numerous
> CPU mode.
>
> This patch set has 3 different sections:
> 1) Patches 1-4: Introduces a queue-based spinlock implementation that
> can replace the default ticket spinlock without increasing the
> size of the spinlock data structure. As a result, critical kernel
> data structures that embed spinlock won't increase in size and
> break data alignments.
> 2) Patches 5-6: Enables the use of unfair queue spinlock in a
> para-virtualized execution environment. This can resolve some
> of the locking related performance issues due to the fact that
> the next CPU to get the lock may have been scheduled out for a
> period of time.
> 3) Patches 7-10: Enable qspinlock para-virtualization support
> by halting the waiting CPUs after spinning for a certain amount of
> time. The unlock code will detect the a sleeping waiter and wake it
> up. This is essentially the same logic as the PV ticketlock code.
>
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention. This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
>
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
>
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
>
> To illustrate the performance benefit of the queue spinlock, the
> ebizzy benchmark was run with the -m option in two different computers:
>
> Test machine ticket-lock queue-lock
> ------------ ----------- ----------
> 4-socket 40-core 2316 rec/s 2899 rec/s
> Westmere-EX (HT off)
> 2-socket 12-core 2130 rec/s 2176 rec/s
> Westmere-EP (HT on)
>

I tested the v7,v8 of qspinlock with unfair config on kvm guest.
I was curious about unfair locks performance in undercommit cases.
(overcommit case is expected to perform well)

But I am seeing hang in overcommit cases. Gdb showed that many vcpus
are halted and there was no progress. Suspecting the problem /race with
halting, I removed the halt() part of kvm_hibernate(). I am yet to take
a closer look at the code on halt() related changes.

Patch series with that change gave around 20% improvement for dbench 2x
and 30% improvement for ebizzy 2x cases. (1x has no significant loss/gain).

2014-04-07 14:10:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On Fri, Apr 04, 2014 at 01:08:16PM -0400, Waiman Long wrote:
> Peter's patch is a rewrite of my patches 1-4, there is no PV or unfair lock
> support in there.

Yes, because your patches were unreadable and entirely non obvious.

And while I appreciate that its not entirely your fault; the subject is
hard, you didn't even try to make it better and explain things in a
normal gradual fashion.

So what I did was start with a 'simple' correct implementation (although
I could have started simpler still I suppose and added 3-4 more patches)
and then added each optimization on top and explained the what and why
for them.

The result is similar code, but the path is ever so much easier to
understand and review.

And no, it doesn't have PV support; I spend the whole week trying to
reverse engineer your patch 1; whereas if you'd presented it in the form
I posted I might have agreed in a day or so.

I still have to look at the PV bits; but seeing how you have shown no
interest in writing coherent and understandable patch sets I'm less
inclined to go stare at them for another few weeks and rewrite that code
too.

Also; theres a few subtle but important differences between the patch
sets. Your series only makes x86_64 use the qspinlocks; the very last we
want is i386 and x86_64 to use different spinlock implementations; we
want less differences between them, not more.

You stuff a whole lot of code into arch/x86 for no reason what so ever.

Prior to that; I also rewrote your benchmark thing; using jiffies for
timing is just vile. Not to mention you require a full kernel build and
reboot (which is just stupid slow on big machines) to test anything.

2014-04-07 14:13:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On Fri, Apr 04, 2014 at 12:57:27PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
> >
> > So I'm just not ever going to pick up this patch; I spend a week trying
> > to reverse engineer this; I posted a 7 patch series creating the
> > equivalent, but in a gradual and readable fashion:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > You keep on ignoring that; I'll keep on ignoring your patches.
> >
> > I might at some point rewrite some of your pv stuff on top to get this
> > moving again, but I'm not really motivated to work with you atm.
>
> Uh? Did you CC also [email protected] on your patches Peter?
> I hadn't had a chance to see or comment on them :-(

No of course not :-)

Also as noted elsewhere, I didn't actually do any PV muck yet. I spend
the time trying to get my head around patch 1; all the while Waiman kept
piling more and more on top.

2014-04-07 14:16:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On Fri, Apr 04, 2014 at 10:59:09AM -0400, Waiman Long wrote:
> I am really sorry if you have bad feeling about it. I do not mean to
> discredit you on your effort to make the qspinlock patch better. I really
> appreciate your input and would like to work with you on this patch as well
> as other future patches.

OK.. that'd be great. Sorry if I sound somewhat cranky (well I am, of
course) partly this is because jet-lag and partly because I just get
grumpy from the Inbox after travel.

2014-04-07 14:34:06

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On Mon, Apr 07, 2014 at 04:12:58PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 04, 2014 at 12:57:27PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
> > >
> > > So I'm just not ever going to pick up this patch; I spend a week trying
> > > to reverse engineer this; I posted a 7 patch series creating the
> > > equivalent, but in a gradual and readable fashion:
> > >
> > > http://lkml.kernel.org/r/[email protected]
> > >
> > > You keep on ignoring that; I'll keep on ignoring your patches.
> > >
> > > I might at some point rewrite some of your pv stuff on top to get this
> > > moving again, but I'm not really motivated to work with you atm.
> >
> > Uh? Did you CC also [email protected] on your patches Peter?
> > I hadn't had a chance to see or comment on them :-(
>
> No of course not :-)
>
> Also as noted elsewhere, I didn't actually do any PV muck yet. I spend
> the time trying to get my head around patch 1; all the while Waiman kept
> piling more and more on top.

Ah, I see. Looking forward to seeing your 'muck' code then :-)

Thanks!

2014-04-07 16:38:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/07/2014 02:14 AM, Raghavendra K T wrote:
>
>
> I tested the v7,v8 of qspinlock with unfair config on kvm guest.
> I was curious about unfair locks performance in undercommit cases.
> (overcommit case is expected to perform well)
>
> But I am seeing hang in overcommit cases. Gdb showed that many vcpus
> are halted and there was no progress. Suspecting the problem /race with
> halting, I removed the halt() part of kvm_hibernate(). I am yet to
> take a closer look at the code on halt() related changes.

It seems like there may still be race conditions where the current code
is not handling correctly. I will look into that to see where the
problem is. BTW, what test do you use to produce the hang condition?

>
> Patch series with that change gave around 20% improvement for dbench
> 2x and 30% improvement for ebizzy 2x cases. (1x has no significant
> loss/gain).
>
>

What is the baseline for the performance improvement? Is it without the
unfair lock and PV qspinlock?

-Longman

2014-04-07 17:00:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

On 04/07/2014 10:09 AM, Peter Zijlstra wrote:
> On Fri, Apr 04, 2014 at 01:08:16PM -0400, Waiman Long wrote:
>> Peter's patch is a rewrite of my patches 1-4, there is no PV or unfair lock
>> support in there.
> Yes, because your patches were unreadable and entirely non obvious.
>
> And while I appreciate that its not entirely your fault; the subject is
> hard, you didn't even try to make it better and explain things in a
> normal gradual fashion.
>
> So what I did was start with a 'simple' correct implementation (although
> I could have started simpler still I suppose and added 3-4 more patches)
> and then added each optimization on top and explained the what and why
> for them.
>
> The result is similar code, but the path is ever so much easier to
> understand and review.

I appreciate your time in rewriting the code to make it easier to
review. I will based my next patch on your rewrite. However, I am going
to make the following minor changes:

1. I would like to series to be compilable with tip/master and v3.15-rc.
Your patches seem to use some constants like __BYTE_ORDER__ that are not
defined there. I will make change that to make them compilable on top of
the latest upstream code.

2. Your patch uses atomic_test_and_set_bit to set the pending bit. I
would prefer to use atomic_cmpxchg() which will ensure that the pending
bit won't be set when some other tasks are already on the queue. This
will also enable me to allow the simple setting of the lock bit without
using atomic instruction. This change doesn't change performance that
much when I tested it on Westmere. But on IvyBridge, it had a pretty big
performance impact.

3. I doubt it is a good idea for the queue head to use atomic_cmpxchg()
to pound on the lock cacheline repeatedly in the waiting loop. It will
make it much harder for the lock holder to access the lock cacheline. I
will use a simple atomic_read to query the status of the lock before
doing any write on it.


>
> And no, it doesn't have PV support; I spend the whole week trying to
> reverse engineer your patch 1; whereas if you'd presented it in the form
> I posted I might have agreed in a day or so.
>
> I still have to look at the PV bits; but seeing how you have shown no
> interest in writing coherent and understandable patch sets I'm less
> inclined to go stare at them for another few weeks and rewrite that code
> too.
>
> Also; theres a few subtle but important differences between the patch
> sets. Your series only makes x86_64 use the qspinlocks; the very last we
> want is i386 and x86_64 to use different spinlock implementations; we
> want less differences between them, not more.

I have no problem of making qspinlock the default for all x86 code.

>
> You stuff a whole lot of code into arch/x86 for no reason what so ever.
>
> Prior to that; I also rewrote your benchmark thing; using jiffies for
> timing is just vile. Not to mention you require a full kernel build and
> reboot (which is just stupid slow on big machines) to test anything.

Yes, using jiffies isn't the best idea, it is just an easier way. BTW, I
don't need to reboot the machine to run my test, I only need to build
the .ko file and use insmod to insert into the running kernel. When it
is done, rmmod can be used to remove it. I do need to rebuild the kernel
when I make changes to the qspinlock patch.

-Longman

2014-04-07 17:46:56

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/07/2014 10:08 PM, Waiman Long wrote:
> On 04/07/2014 02:14 AM, Raghavendra K T wrote:
[...]
>> But I am seeing hang in overcommit cases. Gdb showed that many vcpus
>> are halted and there was no progress. Suspecting the problem /race with
>> halting, I removed the halt() part of kvm_hibernate(). I am yet to
>> take a closer look at the code on halt() related changes.
>
> It seems like there may still be race conditions where the current code
> is not handling correctly. I will look into that to see where the
> problem is. BTW, what test do you use to produce the hang condition?

Running ebizzy on 2 of the vms simultaneously (for sometime in repeated
loop) could reproduce it.

>> Patch series with that change gave around 20% improvement for dbench
>> 2x and 30% improvement for ebizzy 2x cases. (1x has no significant
>> loss/gain).
>>

While at it, Just a correction it was 30% for ebizzy1.5x and around
80% for ebizzy 2x.

> What is the baseline for the performance improvement? Is it without the
> unfair lock and PV qspinlock?

Baseline was 3.14-rc8 without any of the qspin patch series.

2014-04-08 19:16:10

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/07/2014 01:51 PM, Raghavendra K T wrote:
> On 04/07/2014 10:08 PM, Waiman Long wrote:
>> On 04/07/2014 02:14 AM, Raghavendra K T wrote:
> [...]
>>> But I am seeing hang in overcommit cases. Gdb showed that many vcpus
>>> are halted and there was no progress. Suspecting the problem /race with
>>> halting, I removed the halt() part of kvm_hibernate(). I am yet to
>>> take a closer look at the code on halt() related changes.
>>
>> It seems like there may still be race conditions where the current code
>> is not handling correctly. I will look into that to see where the
>> problem is. BTW, what test do you use to produce the hang condition?
>
> Running ebizzy on 2 of the vms simultaneously (for sometime in
> repeated loop) could reproduce it.
>

Yes, I am able to reproduce the hang problem with ebizzy. BTW, could you
try to apply the attached patch file on top of the v8 patch series to
see if it can fix the hang problem?

>
>> What is the baseline for the performance improvement? Is it without the
>> unfair lock and PV qspinlock?
>
> Baseline was 3.14-rc8 without any of the qspin patch series.
>

Does the baseline have PV ticketlock or without any PV support?

-Longman


Attachments:
0011 (4.61 kB)

2014-04-09 12:03:25

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

On 04/09/2014 12:45 AM, Waiman Long wrote:
> Yes, I am able to reproduce the hang problem with ebizzy. BTW, could you
> try to apply the attached patch file on top of the v8 patch series to
> see if it can fix the hang problem?

Ran the benchmarks with the fix and I am not seeing hang so far.
ebizzy improvements.

0.5x 2.7345
1x -10.6593
1.5x 35.6962
2x 88.0461

dbench improvements
0.5x 3.2428
1x 1.1514
1.5x 5.5071
2x 23.8700
( I will have to restest on ebizzy 1x reression but overall performance
numbers look good).

>> Baseline was 3.14-rc8 without any of the qspin patch series.
>>
>
> Does the baseline have PV ticketlock or without any PV support?

baseline had PV ticketlock enabled.