2014-02-17 20:42:03

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

v3->v4:
- Remove debugging code and fix a configuration error
- Simplify the qspinlock structure and streamline the code to make it
perform a bit better
- Add an x86 version of asm/qspinlock.h for holding x86 specific
optimization.
- Add an optimized x86 code path for 2 contending tasks to improve
low contention performance.

v2->v3:
- Simplify the code by using numerous mode only without an unfair option.
- Use the latest smp_load_acquire()/smp_store_release() barriers.
- Move the queue spinlock code to kernel/locking.
- Make the use of queue spinlock the default for x86-64 without user
configuration.
- Additional performance tuning.

v1->v2:
- Add some more comments to document what the code does.
- Add a numerous CPU mode to support >= 16K CPUs
- Add a configuration option to allow lock stealing which can further
improve performance in many cases.
- Enable wakeup of queue head CPU at unlock time for non-numerous
CPU mode.

This patch set introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the size
of the spinlock data structure. As a result, critical kernel data
structures that embed spinlock won't increase in size and breaking
data alignments.

The queue spinlock has slight better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention. This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.

Waiman Long (3):
qspinlock: Introducing a 4-byte queue spinlock implementation
qspinlock, x86: Enable x86-64 to use queue spinlock
qspinlock, x86: Add x86 specific optimization for 2 contending tasks

arch/x86/Kconfig | 1 +
arch/x86/include/asm/qspinlock.h | 236 ++++++++++++++++++++++
arch/x86/include/asm/spinlock.h | 5 +
arch/x86/include/asm/spinlock_types.h | 4 +
include/asm-generic/qspinlock.h | 122 ++++++++++++
include/asm-generic/qspinlock_types.h | 55 +++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qspinlock.c | 353 +++++++++++++++++++++++++++++++++
9 files changed, 784 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/qspinlock.h
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 kernel/locking/qspinlock.c


2014-02-17 20:42:09

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7 : the locked byte
Bits 8-9 : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU. The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

Lock Type Time (ns)
--------- ---------
Ticket spinlock 14.1
Queue spinlock 8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

+------------+----------+-----------------+---------+
| Kernel | 3.13 JPM | 3.13 with | %Change |
| | | qspinlock patch | |
+------------+----------+-----------------+---------+
| 10-100 users |
+------------+----------+-----------------+---------+
|custom | 357459 | 363109 | +1.58% |
|dbase | 496847 | 498801 | +0.39% |
|disk | 2925312 | 2771387 | -5.26% |
|five_sec | 166612 | 169215 | +1.56% |
|fserver | 382129 | 383279 | +0.30% |
|high_systime| 16356 | 16380 | +0.15% |
|short | 4521978 | 4257363 | -5.85% |
+------------+----------+-----------------+---------+
| 200-1000 users |
+------------+----------+-----------------+---------+
|custom | 449070 | 447711 | -0.30% |
|dbase | 845029 | 853362 | +0.99% |
|disk | 2725249 | 4892907 | +79.54% |
|five_sec | 169410 | 170638 | +0.72% |
|fserver | 489662 | 491828 | +0.44% |
|high_systime| 142823 | 143790 | +0.68% |
|short | 7435288 | 9016171 | +21.26% |
+------------+----------+-----------------+---------+
| 1100-2000 users |
+------------+----------+-----------------+---------+
|custom | 432470 | 432570 | +0.02% |
|dbase | 889289 | 890026 | +0.08% |
|disk | 2565138 | 5008732 | +95.26% |
|five_sec | 169141 | 170034 | +0.53% |
|fserver | 498569 | 500701 | +0.43% |
|high_systime| 229913 | 245866 | +6.94% |
|short | 8496794 | 8281918 | -2.53% |
+------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

26.19% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--47.28%-- evict
|--46.87%-- inode_sb_list_add
|--1.24%-- xlog_cil_insert_items
|--0.68%-- __remove_inode_hash
|--0.67%-- inode_wait_for_writeback
--3.26%-- [...]
22.96% swapper [kernel.kallsyms] [k] cpu_idle_loop
5.56% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
4.87% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.04% reaim [kernel.kallsyms] [k] mspin_lock
1.30% reaim [kernel.kallsyms] [k] memcpy
1.08% reaim [unknown] [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

26.82% swapper [kernel.kallsyms] [k] cpu_idle_loop
4.66% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
3.97% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.40% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath
|--88.31%-- _raw_spin_lock
| |--36.02%-- inode_sb_list_add
| |--35.09%-- evict
| |--16.89%-- xlog_cil_insert_items
| |--6.30%-- try_to_wake_up
| |--2.20%-- _xfs_buf_find
| |--0.75%-- __remove_inode_hash
| |--0.72%-- __mutex_lock_slowpath
| |--0.53%-- load_balance
|--6.02%-- _raw_spin_lock_irqsave
| |--74.75%-- down_trylock
| |--9.69%-- rcu_check_quiescent_state
| |--7.47%-- down
| |--3.57%-- up
| |--1.67%-- rwsem_wake
| |--1.00%-- remove_wait_queue
| |--0.56%-- pagevec_lru_move_fn
|--5.39%-- _raw_spin_lock_irq
| |--82.05%-- rwsem_down_read_failed
| |--10.48%-- rwsem_down_write_failed
| |--4.24%-- __down
| |--2.74%-- __schedule
--0.28%-- [...]
2.20% reaim [kernel.kallsyms] [k] memcpy
1.84% reaim [unknown] [.] 0x000000000041517b
1.77% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--21.08%-- xlog_cil_insert_items
|--10.14%-- xfs_icsb_modify_counters
|--7.20%-- xfs_iget_cache_hit
|--6.56%-- inode_sb_list_add
|--5.49%-- _xfs_buf_find
|--5.25%-- evict
|--5.03%-- __remove_inode_hash
|--4.64%-- __mutex_lock_slowpath
|--3.78%-- selinux_inode_free_security
|--2.95%-- xfs_inode_is_filestream
|--2.35%-- try_to_wake_up
|--2.07%-- xfs_inode_set_reclaim_tag
|--1.52%-- list_lru_add
|--1.16%-- xfs_inode_clear_eofblocks_tag
:
1.30% reaim [kernel.kallsyms] [k] effective_load
1.27% reaim [kernel.kallsyms] [k] mspin_lock
1.10% reaim [kernel.kallsyms] [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/asm-generic/qspinlock.h | 122 ++++++++++++
include/asm-generic/qspinlock_types.h | 43 ++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qspinlock.c | 353 +++++++++++++++++++++++++++++++++
5 files changed, 526 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..d2196ed
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+ return _QLOCK(lock);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+ return !_QLOCK(&lock);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+ return _QCODE(lock);
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+ if (!atomic_read(&lock->qlcode) &&
+ (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+ int qsval;
+
+ /*
+ * To reduce memory access to only once for the cold cache case,
+ * a direct cmpxchg() is performed in the fastpath to optimize the
+ * uncontended case. The contended performance, however, may suffer
+ * a bit because of that.
+ */
+ qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+ if (likely(qsval == 0))
+ return;
+ queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * Use an atomic subtraction to clear the lock bit.
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l) queue_spin_is_locked(l)
+#define arch_spin_is_contended(l) queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l) queue_spin_value_unlocked(l)
+#define arch_spin_lock(l) queue_spin_lock(l)
+#define arch_spin_trylock(l) queue_spin_trylock(l)
+#define arch_spin_unlock(l) queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f) queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..e8cc9ca
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,43 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * The bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+ atomic_t qlcode; /* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET 8
+#define _QSPINLOCK_LOCKED 1U
+#define _QCODE(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+#define _QLOCK(lock) (atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..df71bfa 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+ bool
+
+config QUEUE_SPINLOCK
+ def_bool y if ARCH_USE_QUEUE_SPINLOCK
+ depends on SMP && !PARAVIRT
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..1a17380 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..fab9819
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,353 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+ u32 wait; /* Waiting flag */
+ struct qnode *next; /* Next queue node addr */
+};
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1 (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1 (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET _QCODE_OFFSET
+#endif
+#define MAX_QNODES 4
+#define GET_QN_IDX(code) (((code) >> _QCODE_VAL_OFFSET) & 3)
+#define GET_CPU_NR(code) (((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
+#ifndef _SET_QCODE
+#define _SET_QCODE(cpu, idx) ((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
+ ((idx) << _QCODE_VAL_OFFSET) |\
+ _QSPINLOCK_LOCKED)
+#endif
+
+struct qnode_set {
+ int node_idx; /* Current node to use */
+ struct qnode nodes[MAX_QNODES];
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
+ = { 0 };
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+ u32 cpu_nr = GET_CPU_NR(qcode);
+ u8 qn_idx = GET_QN_IDX(qcode);
+
+ return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+#ifndef queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+ int qlcode = atomic_read(lock->qlcode);
+
+ if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+ qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+ return 1;
+ return 0;
+}
+#endif /* queue_spin_trylock_unfair */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+ int qlcode = atomic_read(&lock->qlcode);
+
+ *qcode = qlcode;
+ return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ * no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+ struct qnode_set *qset = this_cpu_ptr(&qnset);
+ int i;
+
+ if (unlikely(qset->node_idx >= MAX_QNODES))
+ return NULL;
+ i = qset->node_idx++;
+ *qn_idx = i;
+ return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+ struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+ qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+ unsigned int cpu_nr, qn_idx;
+ struct qnode *node, *next;
+ u32 prev_qcode, my_qcode;
+
+#ifdef queue_spin_trylock_quick
+ /*
+ * Try the quick spinning code path
+ */
+ if (queue_spin_trylock_quick(lock, qsval))
+ return;
+#endif
+ /*
+ * Get the queue node
+ */
+ cpu_nr = smp_processor_id();
+ node = get_qnode(&qn_idx);
+
+ if (unlikely(!node)) {
+ /*
+ * This shouldn't happen, print a warning message
+ * & busy spinning on the lock.
+ */
+ printk_sched(
+ "qspinlock: queue node table exhausted at cpu %d!\n",
+ cpu_nr);
+ while (!queue_spin_trylock_unfair(lock))
+ arch_mutex_cpu_relax();
+ return;
+ }
+
+ /*
+ * Set up the new cpu code to be exchanged
+ */
+ my_qcode = _SET_QCODE(cpu_nr, qn_idx);
+
+ /*
+ * Initialize the queue node
+ */
+ node->wait = true;
+ node->next = NULL;
+
+ /*
+ * The lock may be available at this point, try again if no task was
+ * waiting in the queue.
+ */
+ if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+ put_qnode();
+ return;
+ }
+
+#ifdef queue_code_xchg
+ prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
+ /*
+ * Exchange current copy of the queue node code
+ */
+ prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+ /*
+ * It is possible that we may accidentally steal the lock. If this is
+ * the case, we need to either release it if not the head of the queue
+ * or get the lock and be done with it.
+ */
+ if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+ if (prev_qcode == 0) {
+ /*
+ * Got the lock since it is at the head of the queue
+ * Now try to atomically clear the queue code.
+ */
+ if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+ _QSPINLOCK_LOCKED) == my_qcode)
+ goto release_node;
+ /*
+ * The cmpxchg fails only if one or more tasks
+ * are added to the queue. In this case, we need to
+ * notify the next one to be the head of the queue.
+ */
+ goto notify_next;
+ }
+ /*
+ * Accidentally steal the lock, release the lock and
+ * let the queue head get it.
+ */
+ queue_spin_unlock(lock);
+ } else
+ prev_qcode &= ~_QSPINLOCK_LOCKED; /* Clear the lock bit */
+ my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
+
+ if (prev_qcode) {
+ /*
+ * Not at the queue head, get the address of the previous node
+ * and set up the "next" fields of the that node.
+ */
+ struct qnode *prev = xlate_qcode(prev_qcode);
+
+ ACCESS_ONCE(prev->next) = node;
+ /*
+ * Wait until the waiting flag is off
+ */
+ while (smp_load_acquire(&node->wait))
+ arch_mutex_cpu_relax();
+ }
+
+ /*
+ * At the head of the wait queue now
+ */
+ while (true) {
+ u32 qcode;
+ int retval;
+
+ retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+ if (retval > 0)
+ ; /* Lock not available yet */
+ else if (retval < 0)
+ /* Lock taken, can release the node & return */
+ goto release_node;
+ else if (qcode != my_qcode) {
+ /*
+ * Just get the lock with other spinners waiting
+ * in the queue.
+ */
+ if (queue_spin_trylock_unfair(lock))
+ goto notify_next;
+ } else {
+ /*
+ * Get the lock & clear the queue code simultaneously
+ */
+ if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+ /* No need to notify the next one */
+ goto release_node;
+ }
+ arch_mutex_cpu_relax();
+ }
+
+notify_next:
+ /*
+ * Wait, if needed, until the next one in queue set up the next field
+ */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+ smp_store_release(&next->wait, false);
+
+release_node:
+ put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
--
1.7.1

2014-02-17 20:42:16

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135/111 135/102 0%/-8%
2 732/950 1315/1573 +80%/+66%
3 1827/1783 2372/2428 +30%/+36%
4 2689/2725 2934/2934 +9%/+8%
5 3736/3748 3658/3652 -2%/-3%
6 4942/4984 4434/4428 -10%/-11%
7 6304/6319 5176/5163 -18%/-18%
8 7736/7629 5955/5944 -23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
2 732/950 523/528 -29%/-44%
3 1827/1783 2366/2384 +30%/+34%

The queue spinlock code path is now a bit faster with 2 contending
tasks. There is also a very slight improvement with 3 contending
tasks.

The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks. Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.

In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:

[Standalone]
# of nodes Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135 135 0%
2 4452 528 -88%
3 10767 2369 -78%
4 20835 2921 -86%

The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 197/178 181/150 -8%/-16%
2 1109/928 435-1417/697-2125
3 1836/1702 1372-3112/1379-3138
4 2717/2429 1842-4158/1846-4170

The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 183 ++++++++++++++++++++++++++++++++-
include/asm-generic/qspinlock_types.h | 16 +++-
2 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7316ec4..ff205c4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,10 +7,18 @@

/*
* Queue spinlock union structure to be used within this header file only
+ * Besides the slock and lock fields, the other fields are only valid with
+ * less than 16K CPUs.
*/
union qspinlock_x86 {
struct qspinlock slock;
- u8 lock; /* Lock bit */
+ struct {
+ u8 lock; /* Lock bit */
+ u8 wait; /* Waiting bit */
+ u16 qcode; /* Queue code */
+ };
+ u16 lock_wait; /* Lock and wait bits */
+ int qlcode;
};

#define queue_spin_trylock_unfair queue_spin_trylock_unfair
@@ -48,6 +56,179 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
barrier();
}

+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ * 1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ * for waiting lock acquirer so that it won't need to go through the
+ * MCS style locking queuing which has a higher overhead.
+ * 2) The 16-bit queue code can be accessed or modified directly as a
+ * 16-bit short value without disturbing the first 2 bytes.
+ */
+#define _QSPINLOCK_WAITING 0x100U /* Waiting bit in 2nd byte */
+#define _QSPINLOCK_LWMASK 0xffff /* Mask for lock & wait bits */
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET 0
+#define _SET_QCODE(cpu, idx) (((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+ u16 old;
+
+ /*
+ * Fall into the quick spinning code path only if no one is waiting
+ * or the lock is available.
+ */
+ if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+ (qsval != _QSPINLOCK_WAITING)))
+ return 0;
+
+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+ if (old == 0) {
+ /*
+ * Got the lock, can clear the waiting bit now
+ */
+ ACCESS_ONCE(qlock->wait) = 0;
+ return 1;
+ } else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+ /*
+ * Wait until the lock byte is cleared to get the lock
+ */
+ do {
+ cpu_relax();
+ } while (ACCESS_ONCE(qlock->lock));
+ /*
+ * Set the lock bit & clear the waiting bit
+ */
+ if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+ _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+ return 1;
+ /*
+ * Someone has steal the lock, so wait again
+ */
+ goto try_again;
+ } else if (old == _QSPINLOCK_WAITING) {
+ /*
+ * Another task is already waiting while it steals the lock.
+ * A bit of unfairness here won't change the big picture.
+ * So just take the lock and return.
+ */
+ return 1;
+ }
+ /*
+ * Nothing need to be done if the old value is
+ * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+ */
+ return 0;
+}
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ qcode <<= _QCODE_OFFSET;
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ * = 0 if lock is free
+ * < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+ u32 qlcode;
+
+ qlcode = (u32)atomic_read(&lock->qlcode);
+ /*
+ * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+ * and mycode. It will try to transition back to the quick spinning
+ * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+ * bit.
+ */
+ if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+ u32 old = qlcode;
+
+ qlcode = atomic_cmpxchg(&lock->qlcode, old,
+ _QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+ if (qlcode == old) {
+ union qspinlock_x86 *slock =
+ (union qspinlock_x86 *)lock;
+try_again:
+ /*
+ * Wait until the lock byte is cleared
+ */
+ do {
+ cpu_relax();
+ } while (ACCESS_ONCE(slock->lock));
+ /*
+ * Set the lock bit & clear the waiting bit
+ */
+ if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+ _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+ return -1; /* Got the lock */
+ goto try_again;
+ }
+ }
+ *qcode = qlcode >> _QCODE_OFFSET;
+ return qlcode & _QSPINLOCK_LWMASK;
+}
+
+#endif /* _Q_MANY_CPUS */
#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */

#include <asm-generic/qspinlock.h>
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index e8cc9ca..486d8fd 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -26,16 +26,28 @@
/*
* The queue spinlock data structure - a 32-bit word
*
- * The bits assignment are:
+ * For NR_CPUS >= 16K, the bits assignment are:
* Bit 0 : Set if locked
* Bits 1-7 : Not used
* Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-15: Reserved for architecture specific optimization
+ * Bits 16-31: Queue code
*/
typedef struct qspinlock {
atomic_t qlcode; /* Lock + queue code */
} arch_spinlock_t;

-#define _QCODE_OFFSET 8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET 8
+#else
+# define _QCODE_OFFSET 16
+#endif
+
#define _QSPINLOCK_LOCKED 1U
#define _QCODE(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
#define _QLOCK(lock) (atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
--
1.7.1

2014-02-17 20:42:51

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 2/3] qspinlock, x86: Enable x86-64 to use queue spinlock

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/qspinlock.h | 55 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/spinlock.h | 5 +++
arch/x86/include/asm/spinlock_types.h | 4 ++
4 files changed, 65 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1b4ff87..5bf70ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+ select ARCH_USE_QUEUE_SPINLOCK

### Arch settings
config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..7316ec4
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,55 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+/*
+ * Queue spinlock union structure to be used within this header file only
+ */
+union qspinlock_x86 {
+ struct qspinlock slock;
+ u8 lock; /* Lock bit */
+};
+
+#define queue_spin_trylock_unfair queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ if (!ACCESS_ONCE(qlock->lock) &&
+ (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+#define queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ barrier();
+ ACCESS_ONCE(qlock->lock) = 0;
+ barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
{
arch_spin_lock(lock);
}
+#endif /* CONFIG_QUEUE_SPINLOCK */

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..98da00b 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -11,6 +11,9 @@
#define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
#endif

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
typedef u8 __ticket_t;
typedef u16 __ticketpair_t;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
} arch_spinlock_t;

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */

#include <asm/rwlock.h>

--
1.7.1

Subject: [tip:x86/spinlocks] qspinlock, x86: Enable x86-64 to use queue spinlock

Commit-ID: 8642749c6ea4a3437ac733341f0c0b83e194b2d6
Gitweb: http://git.kernel.org/tip/8642749c6ea4a3437ac733341f0c0b83e194b2d6
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 17 Feb 2014 15:41:23 -0500
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 17 Feb 2014 13:44:39 -0800

qspinlock, x86: Enable x86-64 to use queue spinlock

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/qspinlock.h | 55 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/spinlock.h | 5 ++++
arch/x86/include/asm/spinlock_types.h | 4 +++
4 files changed, 65 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..de573f9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
depends on 64BIT
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
+ select ARCH_USE_QUEUE_SPINLOCK

### Arch settings
config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..7316ec4
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,55 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+/*
+ * Queue spinlock union structure to be used within this header file only
+ */
+union qspinlock_x86 {
+ struct qspinlock slock;
+ u8 lock; /* Lock bit */
+};
+
+#define queue_spin_trylock_unfair queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ if (!ACCESS_ONCE(qlock->lock) &&
+ (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+#define queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ barrier();
+ ACCESS_ONCE(qlock->lock) = 0;
+ barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
{
arch_spin_lock(lock);
}
+#endif /* CONFIG_QUEUE_SPINLOCK */

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..98da00b 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -11,6 +11,9 @@
#define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
#endif

+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
typedef u8 __ticket_t;
typedef u16 __ticketpair_t;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
} arch_spinlock_t;

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */

#include <asm/rwlock.h>

Subject: [tip:x86/spinlocks] qspinlock: Introducing a 4-byte queue spinlock implementation

Commit-ID: 76a43f3688b67e925d3afa670330e4e05ef2bb1c
Gitweb: http://git.kernel.org/tip/76a43f3688b67e925d3afa670330e4e05ef2bb1c
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 17 Feb 2014 15:41:22 -0500
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 17 Feb 2014 13:44:33 -0800

qspinlock: Introducing a 4-byte queue spinlock implementation

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7 : the locked byte
Bits 8-9 : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU. The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

Lock Type Time (ns)
--------- ---------
Ticket spinlock 14.1
Queue spinlock 8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

+------------+----------+-----------------+---------+
| Kernel | 3.13 JPM | 3.13 with | %Change |
| | | qspinlock patch | |
+------------+----------+-----------------+---------+
| 10-100 users |
+------------+----------+-----------------+---------+
|custom | 357459 | 363109 | +1.58% |
|dbase | 496847 | 498801 | +0.39% |
|disk | 2925312 | 2771387 | -5.26% |
|five_sec | 166612 | 169215 | +1.56% |
|fserver | 382129 | 383279 | +0.30% |
|high_systime| 16356 | 16380 | +0.15% |
|short | 4521978 | 4257363 | -5.85% |
+------------+----------+-----------------+---------+
| 200-1000 users |
+------------+----------+-----------------+---------+
|custom | 449070 | 447711 | -0.30% |
|dbase | 845029 | 853362 | +0.99% |
|disk | 2725249 | 4892907 | +79.54% |
|five_sec | 169410 | 170638 | +0.72% |
|fserver | 489662 | 491828 | +0.44% |
|high_systime| 142823 | 143790 | +0.68% |
|short | 7435288 | 9016171 | +21.26% |
+------------+----------+-----------------+---------+
| 1100-2000 users |
+------------+----------+-----------------+---------+
|custom | 432470 | 432570 | +0.02% |
|dbase | 889289 | 890026 | +0.08% |
|disk | 2565138 | 5008732 | +95.26% |
|five_sec | 169141 | 170034 | +0.53% |
|fserver | 498569 | 500701 | +0.43% |
|high_systime| 229913 | 245866 | +6.94% |
|short | 8496794 | 8281918 | -2.53% |
+------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

26.19% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--47.28%-- evict
|--46.87%-- inode_sb_list_add
|--1.24%-- xlog_cil_insert_items
|--0.68%-- __remove_inode_hash
|--0.67%-- inode_wait_for_writeback
--3.26%-- [...]
22.96% swapper [kernel.kallsyms] [k] cpu_idle_loop
5.56% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
4.87% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.04% reaim [kernel.kallsyms] [k] mspin_lock
1.30% reaim [kernel.kallsyms] [k] memcpy
1.08% reaim [unknown] [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

26.82% swapper [kernel.kallsyms] [k] cpu_idle_loop
4.66% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
3.97% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
2.40% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath
|--88.31%-- _raw_spin_lock
| |--36.02%-- inode_sb_list_add
| |--35.09%-- evict
| |--16.89%-- xlog_cil_insert_items
| |--6.30%-- try_to_wake_up
| |--2.20%-- _xfs_buf_find
| |--0.75%-- __remove_inode_hash
| |--0.72%-- __mutex_lock_slowpath
| |--0.53%-- load_balance
|--6.02%-- _raw_spin_lock_irqsave
| |--74.75%-- down_trylock
| |--9.69%-- rcu_check_quiescent_state
| |--7.47%-- down
| |--3.57%-- up
| |--1.67%-- rwsem_wake
| |--1.00%-- remove_wait_queue
| |--0.56%-- pagevec_lru_move_fn
|--5.39%-- _raw_spin_lock_irq
| |--82.05%-- rwsem_down_read_failed
| |--10.48%-- rwsem_down_write_failed
| |--4.24%-- __down
| |--2.74%-- __schedule
--0.28%-- [...]
2.20% reaim [kernel.kallsyms] [k] memcpy
1.84% reaim [unknown] [.] 0x000000000041517b
1.77% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--21.08%-- xlog_cil_insert_items
|--10.14%-- xfs_icsb_modify_counters
|--7.20%-- xfs_iget_cache_hit
|--6.56%-- inode_sb_list_add
|--5.49%-- _xfs_buf_find
|--5.25%-- evict
|--5.03%-- __remove_inode_hash
|--4.64%-- __mutex_lock_slowpath
|--3.78%-- selinux_inode_free_security
|--2.95%-- xfs_inode_is_filestream
|--2.35%-- try_to_wake_up
|--2.07%-- xfs_inode_set_reclaim_tag
|--1.52%-- list_lru_add
|--1.16%-- xfs_inode_clear_eofblocks_tag
:
1.30% reaim [kernel.kallsyms] [k] effective_load
1.27% reaim [kernel.kallsyms] [k] mspin_lock
1.10% reaim [kernel.kallsyms] [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
include/asm-generic/qspinlock.h | 122 ++++++++++++
include/asm-generic/qspinlock_types.h | 43 +++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qspinlock.c | 353 ++++++++++++++++++++++++++++++++++
5 files changed, 526 insertions(+)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..d2196ed
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+ return _QLOCK(lock);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+ return !_QLOCK(&lock);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+ return _QCODE(lock);
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+ if (!atomic_read(&lock->qlcode) &&
+ (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+ int qsval;
+
+ /*
+ * To reduce memory access to only once for the cold cache case,
+ * a direct cmpxchg() is performed in the fastpath to optimize the
+ * uncontended case. The contended performance, however, may suffer
+ * a bit because of that.
+ */
+ qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+ if (likely(qsval == 0))
+ return;
+ queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * Use an atomic subtraction to clear the lock bit.
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l) queue_spin_is_locked(l)
+#define arch_spin_is_contended(l) queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l) queue_spin_value_unlocked(l)
+#define arch_spin_lock(l) queue_spin_lock(l)
+#define arch_spin_trylock(l) queue_spin_trylock(l)
+#define arch_spin_unlock(l) queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f) queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..e8cc9ca
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,43 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * The bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+ atomic_t qlcode; /* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET 8
+#define _QSPINLOCK_LOCKED 1U
+#define _QCODE(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+#define _QLOCK(lock) (atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..df71bfa 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+ bool
+
+config QUEUE_SPINLOCK
+ def_bool y if ARCH_USE_QUEUE_SPINLOCK
+ depends on SMP && !PARAVIRT
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..1a17380 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..fab9819
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,353 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+ u32 wait; /* Waiting flag */
+ struct qnode *next; /* Next queue node addr */
+};
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1 (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1 (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET _QCODE_OFFSET
+#endif
+#define MAX_QNODES 4
+#define GET_QN_IDX(code) (((code) >> _QCODE_VAL_OFFSET) & 3)
+#define GET_CPU_NR(code) (((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
+#ifndef _SET_QCODE
+#define _SET_QCODE(cpu, idx) ((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
+ ((idx) << _QCODE_VAL_OFFSET) |\
+ _QSPINLOCK_LOCKED)
+#endif
+
+struct qnode_set {
+ int node_idx; /* Current node to use */
+ struct qnode nodes[MAX_QNODES];
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
+ = { 0 };
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+ u32 cpu_nr = GET_CPU_NR(qcode);
+ u8 qn_idx = GET_QN_IDX(qcode);
+
+ return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+#ifndef queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+ int qlcode = atomic_read(lock->qlcode);
+
+ if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+ qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+ return 1;
+ return 0;
+}
+#endif /* queue_spin_trylock_unfair */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+ int qlcode = atomic_read(&lock->qlcode);
+
+ *qcode = qlcode;
+ return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ * no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+ struct qnode_set *qset = this_cpu_ptr(&qnset);
+ int i;
+
+ if (unlikely(qset->node_idx >= MAX_QNODES))
+ return NULL;
+ i = qset->node_idx++;
+ *qn_idx = i;
+ return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+ struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+ qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+ unsigned int cpu_nr, qn_idx;
+ struct qnode *node, *next;
+ u32 prev_qcode, my_qcode;
+
+#ifdef queue_spin_trylock_quick
+ /*
+ * Try the quick spinning code path
+ */
+ if (queue_spin_trylock_quick(lock, qsval))
+ return;
+#endif
+ /*
+ * Get the queue node
+ */
+ cpu_nr = smp_processor_id();
+ node = get_qnode(&qn_idx);
+
+ if (unlikely(!node)) {
+ /*
+ * This shouldn't happen, print a warning message
+ * & busy spinning on the lock.
+ */
+ printk_sched(
+ "qspinlock: queue node table exhausted at cpu %d!\n",
+ cpu_nr);
+ while (!queue_spin_trylock_unfair(lock))
+ arch_mutex_cpu_relax();
+ return;
+ }
+
+ /*
+ * Set up the new cpu code to be exchanged
+ */
+ my_qcode = _SET_QCODE(cpu_nr, qn_idx);
+
+ /*
+ * Initialize the queue node
+ */
+ node->wait = true;
+ node->next = NULL;
+
+ /*
+ * The lock may be available at this point, try again if no task was
+ * waiting in the queue.
+ */
+ if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+ put_qnode();
+ return;
+ }
+
+#ifdef queue_code_xchg
+ prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
+ /*
+ * Exchange current copy of the queue node code
+ */
+ prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+ /*
+ * It is possible that we may accidentally steal the lock. If this is
+ * the case, we need to either release it if not the head of the queue
+ * or get the lock and be done with it.
+ */
+ if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+ if (prev_qcode == 0) {
+ /*
+ * Got the lock since it is at the head of the queue
+ * Now try to atomically clear the queue code.
+ */
+ if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+ _QSPINLOCK_LOCKED) == my_qcode)
+ goto release_node;
+ /*
+ * The cmpxchg fails only if one or more tasks
+ * are added to the queue. In this case, we need to
+ * notify the next one to be the head of the queue.
+ */
+ goto notify_next;
+ }
+ /*
+ * Accidentally steal the lock, release the lock and
+ * let the queue head get it.
+ */
+ queue_spin_unlock(lock);
+ } else
+ prev_qcode &= ~_QSPINLOCK_LOCKED; /* Clear the lock bit */
+ my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
+
+ if (prev_qcode) {
+ /*
+ * Not at the queue head, get the address of the previous node
+ * and set up the "next" fields of the that node.
+ */
+ struct qnode *prev = xlate_qcode(prev_qcode);
+
+ ACCESS_ONCE(prev->next) = node;
+ /*
+ * Wait until the waiting flag is off
+ */
+ while (smp_load_acquire(&node->wait))
+ arch_mutex_cpu_relax();
+ }
+
+ /*
+ * At the head of the wait queue now
+ */
+ while (true) {
+ u32 qcode;
+ int retval;
+
+ retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+ if (retval > 0)
+ ; /* Lock not available yet */
+ else if (retval < 0)
+ /* Lock taken, can release the node & return */
+ goto release_node;
+ else if (qcode != my_qcode) {
+ /*
+ * Just get the lock with other spinners waiting
+ * in the queue.
+ */
+ if (queue_spin_trylock_unfair(lock))
+ goto notify_next;
+ } else {
+ /*
+ * Get the lock & clear the queue code simultaneously
+ */
+ if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+ /* No need to notify the next one */
+ goto release_node;
+ }
+ arch_mutex_cpu_relax();
+ }
+
+notify_next:
+ /*
+ * Wait, if needed, until the next one in queue set up the next field
+ */
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+ smp_store_release(&next->wait, false);
+
+release_node:
+ put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);

Subject: [tip:x86/spinlocks] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Commit-ID: 255f74f845af196c3f45297d07d1ac7a219bf949
Gitweb: http://git.kernel.org/tip/255f74f845af196c3f45297d07d1ac7a219bf949
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 17 Feb 2014 15:41:24 -0500
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 17 Feb 2014 13:44:43 -0800

qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135/111 135/102 0%/-8%
2 732/950 1315/1573 +80%/+66%
3 1827/1783 2372/2428 +30%/+36%
4 2689/2725 2934/2934 +9%/+8%
5 3736/3748 3658/3652 -2%/-3%
6 4942/4984 4434/4428 -10%/-11%
7 6304/6319 5176/5163 -18%/-18%
8 7736/7629 5955/5944 -23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
2 732/950 523/528 -29%/-44%
3 1827/1783 2366/2384 +30%/+34%

The queue spinlock code path is now a bit faster with 2 contending
tasks. There is also a very slight improvement with 3 contending
tasks.

The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks. Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.

In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:

[Standalone]
# of nodes Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 135 135 0%
2 4452 528 -88%
3 10767 2369 -78%
4 20835 2921 -86%

The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:

[Standalone/Embedded]
# of tasks Ticket lock Queue lock %Change
---------- ----------- ---------- -------
1 197/178 181/150 -8%/-16%
2 1109/928 435-1417/697-2125
3 1836/1702 1372-3112/1379-3138
4 2717/2429 1842-4158/1846-4170

The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.

Signed-off-by: Waiman Long <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 183 +++++++++++++++++++++++++++++++++-
include/asm-generic/qspinlock_types.h | 16 ++-
2 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7316ec4..ff205c4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,10 +7,18 @@

/*
* Queue spinlock union structure to be used within this header file only
+ * Besides the slock and lock fields, the other fields are only valid with
+ * less than 16K CPUs.
*/
union qspinlock_x86 {
struct qspinlock slock;
- u8 lock; /* Lock bit */
+ struct {
+ u8 lock; /* Lock bit */
+ u8 wait; /* Waiting bit */
+ u16 qcode; /* Queue code */
+ };
+ u16 lock_wait; /* Lock and wait bits */
+ int qlcode;
};

#define queue_spin_trylock_unfair queue_spin_trylock_unfair
@@ -48,6 +56,179 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
barrier();
}

+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ * 1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ * for waiting lock acquirer so that it won't need to go through the
+ * MCS style locking queuing which has a higher overhead.
+ * 2) The 16-bit queue code can be accessed or modified directly as a
+ * 16-bit short value without disturbing the first 2 bytes.
+ */
+#define _QSPINLOCK_WAITING 0x100U /* Waiting bit in 2nd byte */
+#define _QSPINLOCK_LWMASK 0xffff /* Mask for lock & wait bits */
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET 0
+#define _SET_QCODE(cpu, idx) (((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+ u16 old;
+
+ /*
+ * Fall into the quick spinning code path only if no one is waiting
+ * or the lock is available.
+ */
+ if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+ (qsval != _QSPINLOCK_WAITING)))
+ return 0;
+
+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+ if (old == 0) {
+ /*
+ * Got the lock, can clear the waiting bit now
+ */
+ ACCESS_ONCE(qlock->wait) = 0;
+ return 1;
+ } else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+ /*
+ * Wait until the lock byte is cleared to get the lock
+ */
+ do {
+ cpu_relax();
+ } while (ACCESS_ONCE(qlock->lock));
+ /*
+ * Set the lock bit & clear the waiting bit
+ */
+ if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+ _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+ return 1;
+ /*
+ * Someone has steal the lock, so wait again
+ */
+ goto try_again;
+ } else if (old == _QSPINLOCK_WAITING) {
+ /*
+ * Another task is already waiting while it steals the lock.
+ * A bit of unfairness here won't change the big picture.
+ * So just take the lock and return.
+ */
+ return 1;
+ }
+ /*
+ * Nothing need to be done if the old value is
+ * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+ */
+ return 0;
+}
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+ union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+ return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+ qcode <<= _QCODE_OFFSET;
+ return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ * = 0 if lock is free
+ * < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+ u32 qlcode;
+
+ qlcode = (u32)atomic_read(&lock->qlcode);
+ /*
+ * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+ * and mycode. It will try to transition back to the quick spinning
+ * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+ * bit.
+ */
+ if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+ u32 old = qlcode;
+
+ qlcode = atomic_cmpxchg(&lock->qlcode, old,
+ _QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+ if (qlcode == old) {
+ union qspinlock_x86 *slock =
+ (union qspinlock_x86 *)lock;
+try_again:
+ /*
+ * Wait until the lock byte is cleared
+ */
+ do {
+ cpu_relax();
+ } while (ACCESS_ONCE(slock->lock));
+ /*
+ * Set the lock bit & clear the waiting bit
+ */
+ if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+ _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+ return -1; /* Got the lock */
+ goto try_again;
+ }
+ }
+ *qcode = qlcode >> _QCODE_OFFSET;
+ return qlcode & _QSPINLOCK_LWMASK;
+}
+
+#endif /* _Q_MANY_CPUS */
#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */

#include <asm-generic/qspinlock.h>
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index e8cc9ca..486d8fd 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -26,16 +26,28 @@
/*
* The queue spinlock data structure - a 32-bit word
*
- * The bits assignment are:
+ * For NR_CPUS >= 16K, the bits assignment are:
* Bit 0 : Set if locked
* Bits 1-7 : Not used
* Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ * Bit 0 : Set if locked
+ * Bits 1-7 : Not used
+ * Bits 8-15: Reserved for architecture specific optimization
+ * Bits 16-31: Queue code
*/
typedef struct qspinlock {
atomic_t qlcode; /* Lock + queue code */
} arch_spinlock_t;

-#define _QCODE_OFFSET 8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET 8
+#else
+# define _QCODE_OFFSET 16
+#endif
+
#define _QSPINLOCK_LOCKED 1U
#define _QCODE(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
#define _QLOCK(lock) (atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)

2014-02-17 22:48:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/17/2014 12:41 PM, Waiman Long wrote:
> v3->v4:
> - Remove debugging code and fix a configuration error
> - Simplify the qspinlock structure and streamline the code to make it
> perform a bit better
> - Add an x86 version of asm/qspinlock.h for holding x86 specific
> optimization.
> - Add an optimized x86 code path for 2 contending tasks to improve
> low contention performance.
>
> v2->v3:
> - Simplify the code by using numerous mode only without an unfair option.
> - Use the latest smp_load_acquire()/smp_store_release() barriers.
> - Move the queue spinlock code to kernel/locking.
> - Make the use of queue spinlock the default for x86-64 without user
> configuration.
> - Additional performance tuning.
>
> v1->v2:
> - Add some more comments to document what the code does.
> - Add a numerous CPU mode to support >= 16K CPUs
> - Add a configuration option to allow lock stealing which can further
> improve performance in many cases.
> - Enable wakeup of queue head CPU at unlock time for non-numerous
> CPU mode.
>
> This patch set introduces a queue-based spinlock implementation that
> can replace the default ticket spinlock without increasing the size
> of the spinlock data structure. As a result, critical kernel data
> structures that embed spinlock won't increase in size and breaking
> data alignments.
>

This is starting to look good, so I have pulled it into
tip:x86/spinlocks to start give it some testing mileage.

-hpa

2014-02-18 07:31:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> On 02/17/2014 12:41 PM, Waiman Long wrote:
> > v3->v4:
> > - Remove debugging code and fix a configuration error
> > - Simplify the qspinlock structure and streamline the code to make it
> > perform a bit better
> > - Add an x86 version of asm/qspinlock.h for holding x86 specific
> > optimization.
> > - Add an optimized x86 code path for 2 contending tasks to improve
> > low contention performance.
> >
> > v2->v3:
> > - Simplify the code by using numerous mode only without an unfair option.
> > - Use the latest smp_load_acquire()/smp_store_release() barriers.
> > - Move the queue spinlock code to kernel/locking.
> > - Make the use of queue spinlock the default for x86-64 without user
> > configuration.
> > - Additional performance tuning.
> >
> > v1->v2:
> > - Add some more comments to document what the code does.
> > - Add a numerous CPU mode to support >= 16K CPUs
> > - Add a configuration option to allow lock stealing which can further
> > improve performance in many cases.
> > - Enable wakeup of queue head CPU at unlock time for non-numerous
> > CPU mode.
> >
> > This patch set introduces a queue-based spinlock implementation that
> > can replace the default ticket spinlock without increasing the size
> > of the spinlock data structure. As a result, critical kernel data
> > structures that embed spinlock won't increase in size and breaking
> > data alignments.
> >
>
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.

It very much needs paravirt muck before we can even consider it.

2014-02-18 07:31:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +/*
> + * The queue node structure
> + *
> + * This structure is essentially the same as the mcs_spinlock structure
> + * in mcs_spinlock.h file. This structure is retained for future extension
> + * where new fields may be added.
> + */
> +struct qnode {
> + u32 wait; /* Waiting flag */
> + struct qnode *next; /* Next queue node addr */
> +};
> +
> +struct qnode_set {
> + int node_idx; /* Current node to use */
> + struct qnode nodes[MAX_QNODES];
> +};
> +
> +/*
> + * Per-CPU queue node structures
> + */
> +static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
> + = { 0 };

You really didn't pay attention did you.

That should be DEFINE_PER_CPU_ALIGNED()

Furthermore, your structure is bigger than 1 cacheline; your struct
qnode is 16 bytes, 4*16=64, and then you add that int.

2014-02-18 07:33:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +#define _QCODE(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
> +#define _QLOCK(lock) (atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)

> +#define GET_QN_IDX(code) (((code) >> _QCODE_VAL_OFFSET) & 3)
> +#define GET_CPU_NR(code) (((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
> +#ifndef _SET_QCODE
> +#define _SET_QCODE(cpu, idx) ((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
> + ((idx) << _QCODE_VAL_OFFSET) |\


Someone please take the CPP away.. dude most this doesn't need to be a
macro. Its used only in 1 or two places and its utterly unreadable.

2014-02-18 07:38:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.

Its still atrociously ugly code please drop it.

2014-02-18 07:40:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
> +{
> + unsigned int cpu_nr, qn_idx;
> + struct qnode *node, *next;
> + u32 prev_qcode, my_qcode;
> +
> +#ifdef queue_spin_trylock_quick
> + /*
> + * Try the quick spinning code path
> + */
> + if (queue_spin_trylock_quick(lock, qsval))
> + return;
> +#endif

why oh why?

> + /*
> + * Get the queue node
> + */
> + cpu_nr = smp_processor_id();
> + node = get_qnode(&qn_idx);
> +
> + if (unlikely(!node)) {
> + /*
> + * This shouldn't happen, print a warning message
> + * & busy spinning on the lock.
> + */
> + printk_sched(
> + "qspinlock: queue node table exhausted at cpu %d!\n",
> + cpu_nr);
> + while (!queue_spin_trylock_unfair(lock))
> + arch_mutex_cpu_relax();
> + return;
> + }
> +
> + /*
> + * Set up the new cpu code to be exchanged
> + */
> + my_qcode = _SET_QCODE(cpu_nr, qn_idx);
> +
> + /*
> + * Initialize the queue node
> + */
> + node->wait = true;
> + node->next = NULL;
> +
> + /*
> + * The lock may be available at this point, try again if no task was
> + * waiting in the queue.
> + */
> + if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
> + put_qnode();
> + return;
> + }
> +
> +#ifdef queue_code_xchg
> + prev_qcode = queue_code_xchg(lock, my_qcode);
> +#else
> + /*
> + * Exchange current copy of the queue node code
> + */
> + prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
> + /*
> + * It is possible that we may accidentally steal the lock. If this is
> + * the case, we need to either release it if not the head of the queue
> + * or get the lock and be done with it.
> + */
> + if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
> + if (prev_qcode == 0) {
> + /*
> + * Got the lock since it is at the head of the queue
> + * Now try to atomically clear the queue code.
> + */
> + if (atomic_cmpxchg(&lock->qlcode, my_qcode,
> + _QSPINLOCK_LOCKED) == my_qcode)
> + goto release_node;
> + /*
> + * The cmpxchg fails only if one or more tasks
> + * are added to the queue. In this case, we need to
> + * notify the next one to be the head of the queue.
> + */
> + goto notify_next;
> + }
> + /*
> + * Accidentally steal the lock, release the lock and
> + * let the queue head get it.
> + */
> + queue_spin_unlock(lock);
> + } else
> + prev_qcode &= ~_QSPINLOCK_LOCKED; /* Clear the lock bit */
> + my_qcode &= ~_QSPINLOCK_LOCKED;
> +#endif /* queue_code_xchg */

WTF is this #ifdef for?

> + if (prev_qcode) {
> + /*
> + * Not at the queue head, get the address of the previous node
> + * and set up the "next" fields of the that node.
> + */
> + struct qnode *prev = xlate_qcode(prev_qcode);
> +
> + ACCESS_ONCE(prev->next) = node;
> + /*
> + * Wait until the waiting flag is off
> + */
> + while (smp_load_acquire(&node->wait))
> + arch_mutex_cpu_relax();
> + }
> +
> + /*
> + * At the head of the wait queue now
> + */
> + while (true) {
> + u32 qcode;
> + int retval;
> +
> + retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
> + if (retval > 0)
> + ; /* Lock not available yet */
> + else if (retval < 0)
> + /* Lock taken, can release the node & return */
> + goto release_node;
> + else if (qcode != my_qcode) {
> + /*
> + * Just get the lock with other spinners waiting
> + * in the queue.
> + */
> + if (queue_spin_trylock_unfair(lock))
> + goto notify_next;

Why is this an option at all?

> + } else {
> + /*
> + * Get the lock & clear the queue code simultaneously
> + */
> + if (queue_spin_trylock_and_clr_qcode(lock, qcode))
> + /* No need to notify the next one */
> + goto release_node;
> + }
> + arch_mutex_cpu_relax();
> + }
> +
> +notify_next:
> + /*
> + * Wait, if needed, until the next one in queue set up the next field
> + */
> + while (!(next = ACCESS_ONCE(node->next)))
> + arch_mutex_cpu_relax();
> + /*
> + * The next one in queue is now at the head
> + */
> + smp_store_release(&next->wait, false);
> +
> +release_node:
> + put_qnode();
> +}
> +EXPORT_SYMBOL(queue_spin_lock_slowpath);
> --
> 1.7.1
>

2014-02-18 07:45:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/17/2014 11:31 PM, Peter Zijlstra wrote:
>>
>> This is starting to look good, so I have pulled it into
>> tip:x86/spinlocks to start give it some testing mileage.
>
> It very much needs paravirt muck before we can even consider it.
>

Ah yes. Joy ... I so love PV.

-hpa

2014-02-18 19:27:59

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/17/2014 05:47 PM, H. Peter Anvin wrote:
> On 02/17/2014 12:41 PM, Waiman Long wrote:
>> v3->v4:
>> - Remove debugging code and fix a configuration error
>> - Simplify the qspinlock structure and streamline the code to make it
>> perform a bit better
>> - Add an x86 version of asm/qspinlock.h for holding x86 specific
>> optimization.
>> - Add an optimized x86 code path for 2 contending tasks to improve
>> low contention performance.
>>
>> v2->v3:
>> - Simplify the code by using numerous mode only without an unfair option.
>> - Use the latest smp_load_acquire()/smp_store_release() barriers.
>> - Move the queue spinlock code to kernel/locking.
>> - Make the use of queue spinlock the default for x86-64 without user
>> configuration.
>> - Additional performance tuning.
>>
>> v1->v2:
>> - Add some more comments to document what the code does.
>> - Add a numerous CPU mode to support>= 16K CPUs
>> - Add a configuration option to allow lock stealing which can further
>> improve performance in many cases.
>> - Enable wakeup of queue head CPU at unlock time for non-numerous
>> CPU mode.
>>
>> This patch set introduces a queue-based spinlock implementation that
>> can replace the default ticket spinlock without increasing the size
>> of the spinlock data structure. As a result, critical kernel data
>> structures that embed spinlock won't increase in size and breaking
>> data alignments.
>>
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.
>
> -hpa
>
>

Thank for the additional testing. Please let me know if you find
anything wrong.

-Longman

2014-02-18 19:29:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/18/2014 02:30 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +/*
>> + * The queue node structure
>> + *
>> + * This structure is essentially the same as the mcs_spinlock structure
>> + * in mcs_spinlock.h file. This structure is retained for future extension
>> + * where new fields may be added.
>> + */
>> +struct qnode {
>> + u32 wait; /* Waiting flag */
>> + struct qnode *next; /* Next queue node addr */
>> +};
>> +
>> +struct qnode_set {
>> + int node_idx; /* Current node to use */
>> + struct qnode nodes[MAX_QNODES];
>> +};
>> +
>> +/*
>> + * Per-CPU queue node structures
>> + */
>> +static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
>> + = { 0 };
> You really didn't pay attention did you.
>
> That should be DEFINE_PER_CPU_ALIGNED()
>
> Furthermore, your structure is bigger than 1 cacheline; your struct
> qnode is 16 bytes, 4*16=64, and then you add that int.

Thank for letting me know about that. This is a minor problem that I
will fix in the next version.

-Longman

2014-02-18 19:30:45

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/18/2014 02:31 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
>> On 02/17/2014 12:41 PM, Waiman Long wrote:
>>> v3->v4:
>>> - Remove debugging code and fix a configuration error
>>> - Simplify the qspinlock structure and streamline the code to make it
>>> perform a bit better
>>> - Add an x86 version of asm/qspinlock.h for holding x86 specific
>>> optimization.
>>> - Add an optimized x86 code path for 2 contending tasks to improve
>>> low contention performance.
>>>
>>> v2->v3:
>>> - Simplify the code by using numerous mode only without an unfair option.
>>> - Use the latest smp_load_acquire()/smp_store_release() barriers.
>>> - Move the queue spinlock code to kernel/locking.
>>> - Make the use of queue spinlock the default for x86-64 without user
>>> configuration.
>>> - Additional performance tuning.
>>>
>>> v1->v2:
>>> - Add some more comments to document what the code does.
>>> - Add a numerous CPU mode to support>= 16K CPUs
>>> - Add a configuration option to allow lock stealing which can further
>>> improve performance in many cases.
>>> - Enable wakeup of queue head CPU at unlock time for non-numerous
>>> CPU mode.
>>>
>>> This patch set introduces a queue-based spinlock implementation that
>>> can replace the default ticket spinlock without increasing the size
>>> of the spinlock data structure. As a result, critical kernel data
>>> structures that embed spinlock won't increase in size and breaking
>>> data alignments.
>>>
>> This is starting to look good, so I have pulled it into
>> tip:x86/spinlocks to start give it some testing mileage.
> It very much needs paravirt muck before we can even consider it.

I will start looking at how to make it work with paravirt. Hopefully, it
won't take too long.

-Longman

2014-02-18 19:32:36

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/18/2014 02:33 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +#define _QCODE(lock) (atomic_read(&(lock)->qlcode)>> _QCODE_OFFSET)
>> +#define _QLOCK(lock) (atomic_read(&(lock)->qlcode)& _QSPINLOCK_LOCKED)
>> +#define GET_QN_IDX(code) (((code)>> _QCODE_VAL_OFFSET)& 3)
>> +#define GET_CPU_NR(code) (((code)>> (_QCODE_VAL_OFFSET + 2)) - 1)
>> +#ifndef _SET_QCODE
>> +#define _SET_QCODE(cpu, idx) ((((cpu) + 1)<< (_QCODE_VAL_OFFSET + 2)) |\
>> + ((idx)<< _QCODE_VAL_OFFSET) |\
>
> Someone please take the CPP away.. dude most this doesn't need to be a
> macro. Its used only in 1 or two places and its utterly unreadable.

Yes, I can take them away. It is not a big problem.

-Longman

2014-02-18 19:40:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/18/2014 02:39 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>> +{
>> + unsigned int cpu_nr, qn_idx;
>> + struct qnode *node, *next;
>> + u32 prev_qcode, my_qcode;
>> +
>> +#ifdef queue_spin_trylock_quick
>> + /*
>> + * Try the quick spinning code path
>> + */
>> + if (queue_spin_trylock_quick(lock, qsval))
>> + return;
>> +#endif
> why oh why?

I could take this #ifdef away. I just need to add a default version that
always return 0.

>> + /*
>> + * Get the queue node
>> + */
>> + cpu_nr = smp_processor_id();
>> + node = get_qnode(&qn_idx);
>> +
>> + if (unlikely(!node)) {
>> + /*
>> + * This shouldn't happen, print a warning message
>> + *& busy spinning on the lock.
>> + */
>> + printk_sched(
>> + "qspinlock: queue node table exhausted at cpu %d!\n",
>> + cpu_nr);
>> + while (!queue_spin_trylock_unfair(lock))
>> + arch_mutex_cpu_relax();
>> + return;
>> + }
>> +
>> + /*
>> + * Set up the new cpu code to be exchanged
>> + */
>> + my_qcode = _SET_QCODE(cpu_nr, qn_idx);
>> +
>> + /*
>> + * Initialize the queue node
>> + */
>> + node->wait = true;
>> + node->next = NULL;
>> +
>> + /*
>> + * The lock may be available at this point, try again if no task was
>> + * waiting in the queue.
>> + */
>> + if (!(qsval>> _QCODE_OFFSET)&& queue_spin_trylock(lock)) {
>> + put_qnode();
>> + return;
>> + }
>> +
>> +#ifdef queue_code_xchg
>> + prev_qcode = queue_code_xchg(lock, my_qcode);
>> +#else
>> + /*
>> + * Exchange current copy of the queue node code
>> + */
>> + prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
>> + /*
>> + * It is possible that we may accidentally steal the lock. If this is
>> + * the case, we need to either release it if not the head of the queue
>> + * or get the lock and be done with it.
>> + */
>> + if (unlikely(!(prev_qcode& _QSPINLOCK_LOCKED))) {
>> + if (prev_qcode == 0) {
>> + /*
>> + * Got the lock since it is at the head of the queue
>> + * Now try to atomically clear the queue code.
>> + */
>> + if (atomic_cmpxchg(&lock->qlcode, my_qcode,
>> + _QSPINLOCK_LOCKED) == my_qcode)
>> + goto release_node;
>> + /*
>> + * The cmpxchg fails only if one or more tasks
>> + * are added to the queue. In this case, we need to
>> + * notify the next one to be the head of the queue.
>> + */
>> + goto notify_next;
>> + }
>> + /*
>> + * Accidentally steal the lock, release the lock and
>> + * let the queue head get it.
>> + */
>> + queue_spin_unlock(lock);
>> + } else
>> + prev_qcode&= ~_QSPINLOCK_LOCKED; /* Clear the lock bit */
>> + my_qcode&= ~_QSPINLOCK_LOCKED;
>> +#endif /* queue_code_xchg */
> WTF is this #ifdef for?

The #ifdef is harder to take away here. The point is that doing a 32-bit
exchange may accidentally steal the lock with the additional code to
handle that. Doing a 16-bit exchange, on the other hand, will never
steal the lock and so don't need the extra handling code. I could
construct a function with different return values to handle the
different cases if you think it will make the code easier to read.


>> + if (prev_qcode) {
>> + /*
>> + * Not at the queue head, get the address of the previous node
>> + * and set up the "next" fields of the that node.
>> + */
>> + struct qnode *prev = xlate_qcode(prev_qcode);
>> +
>> + ACCESS_ONCE(prev->next) = node;
>> + /*
>> + * Wait until the waiting flag is off
>> + */
>> + while (smp_load_acquire(&node->wait))
>> + arch_mutex_cpu_relax();
>> + }
>> +
>> + /*
>> + * At the head of the wait queue now
>> + */
>> + while (true) {
>> + u32 qcode;
>> + int retval;
>> +
>> + retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>> + if (retval> 0)
>> + ; /* Lock not available yet */
>> + else if (retval< 0)
>> + /* Lock taken, can release the node& return */
>> + goto release_node;
>> + else if (qcode != my_qcode) {
>> + /*
>> + * Just get the lock with other spinners waiting
>> + * in the queue.
>> + */
>> + if (queue_spin_trylock_unfair(lock))
>> + goto notify_next;
> Why is this an option at all?
>
>

Are you referring to the case (qcode != my_qcode)? This condition will
be true if more than one tasks have queued up.

-Longman

2014-02-18 21:28:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> I will start looking at how to make it work with paravirt. Hopefully, it
> won't take too long.

The cheap way out is to simply switch to the test-and-set spinlock on
whatever X86_FEATURE_ indicates a guest I suppose.

2014-02-18 21:34:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> The #ifdef is harder to take away here. The point is that doing a 32-bit
> exchange may accidentally steal the lock with the additional code to handle
> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
> and so don't need the extra handling code. I could construct a function with
> different return values to handle the different cases if you think it will
> make the code easier to read.

Does it really pay to use xchg() with all those fixup cases? Why not
have a single cmpxchg() loop that does just the exact atomic op you
want?

2014-02-18 21:38:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>+ /*
> >>+ * At the head of the wait queue now
> >>+ */
> >>+ while (true) {
> >>+ u32 qcode;
> >>+ int retval;
> >>+
> >>+ retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
> >>+ if (retval> 0)
> >>+ ; /* Lock not available yet */
> >>+ else if (retval< 0)
> >>+ /* Lock taken, can release the node& return */
> >>+ goto release_node;
> >>+ else if (qcode != my_qcode) {
> >>+ /*
> >>+ * Just get the lock with other spinners waiting
> >>+ * in the queue.
> >>+ */
> >>+ if (queue_spin_trylock_unfair(lock))
> >>+ goto notify_next;
> >Why is this an option at all?
> >
> >
>
> Are you referring to the case (qcode != my_qcode)? This condition will be
> true if more than one tasks have queued up.

But in no case should we revert to unfair spinning or stealing. We
should always respect the queueing order.

If the lock tail no longer points to us, then there's further waiters
and we should wait for ->next and unlock it -- after we've taken the
lock.

2014-02-19 00:42:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>> I will start looking at how to make it work with paravirt. Hopefully, it
>> won't take too long.
> The cheap way out is to simply switch to the test-and-set spinlock on
> whatever X86_FEATURE_ indicates a guest I suppose.

I don't think there is X86_FEATURE flag that indicates running in a
guest. In fact, a guest should never find out if it is running virtualized.

After reading the current PV ticketlock implementation, I have a rough
idea of what I need to do to implement PV support in qspinlock. A large
portion of PV ticketlock code is find out the CPU number of the next one
to get the lock. The current qspinlock implementation has already
included CPU number of the previous member in the queue and it should be
pretty easy to store CPU number of the next one in the queue node
structure. These CPU numbers can then be supplied to the kick_cpu()
function to schedule in the require the CPU to make sure that progress
can be made.

I will try to implement this idea to see how thing work out.

-Longman

2014-02-19 00:50:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>> The #ifdef is harder to take away here. The point is that doing a 32-bit
>> exchange may accidentally steal the lock with the additional code to handle
>> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
>> and so don't need the extra handling code. I could construct a function with
>> different return values to handle the different cases if you think it will
>> make the code easier to read.
> Does it really pay to use xchg() with all those fixup cases? Why not
> have a single cmpxchg() loop that does just the exact atomic op you
> want?

The main reason for using xchg instead of cmpxchg is its performance
impact when the lock is heavily contended. Under those circumstances, a
task may need to do several tries of read+atomic-RMV before getting it
right. This may cause a lot of cacheline contention. With xchg, we need
at most 2 atomic ops. Using cmpxchg() does simplify the code a bit at
the expense of performance with heavy contention.

-Longman

2014-02-19 00:59:12

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>> + /*
>>>> + * At the head of the wait queue now
>>>> + */
>>>> + while (true) {
>>>> + u32 qcode;
>>>> + int retval;
>>>> +
>>>> + retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>> + if (retval> 0)
>>>> + ; /* Lock not available yet */
>>>> + else if (retval< 0)
>>>> + /* Lock taken, can release the node& return */
>>>> + goto release_node;
>>>> + else if (qcode != my_qcode) {
>>>> + /*
>>>> + * Just get the lock with other spinners waiting
>>>> + * in the queue.
>>>> + */
>>>> + if (queue_spin_trylock_unfair(lock))
>>>> + goto notify_next;
>>> Why is this an option at all?
>>>
>>>
>> Are you referring to the case (qcode != my_qcode)? This condition will be
>> true if more than one tasks have queued up.
> But in no case should we revert to unfair spinning or stealing. We
> should always respect the queueing order.
>
> If the lock tail no longer points to us, then there's further waiters
> and we should wait for ->next and unlock it -- after we've taken the
> lock.
>

A task will be in this loop when it is already the head of a queue and
is entitled to take the lock. The condition (qcode != my_qcode) is to
decide whether it should just take the lock or take the lock & clear the
code simultaneously. I am a bit cautious to use
queue_spin_trylock_unfair() as there is a possibility that a CPU may run
out of the queue node and need to do unfair busy spinning.

-Longman

2014-02-19 07:03:40

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/19/2014 06:12 AM, Waiman Long wrote:
> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>> won't take too long.
>> The cheap way out is to simply switch to the test-and-set spinlock on
>> whatever X86_FEATURE_ indicates a guest I suppose.
>
> I don't think there is X86_FEATURE flag that indicates running in a
> guest. In fact, a guest should never find out if it is running virtualized.
>
> After reading the current PV ticketlock implementation, I have a rough
> idea of what I need to do to implement PV support in qspinlock. A large
> portion of PV ticketlock code is find out the CPU number of the next one
> to get the lock. The current qspinlock implementation has already
> included CPU number of the previous member in the queue and it should be
> pretty easy to store CPU number of the next one in the queue node
> structure. These CPU numbers can then be supplied to the kick_cpu()
> function to schedule in the require the CPU to make sure that progress
> can be made.

That is correct.
Strict serialization of the lock is usually a headache for virtualized
guest (especially when overcommitted). I am eager to test the next
version.

2014-02-19 08:51:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> >>I will start looking at how to make it work with paravirt. Hopefully, it
> >>won't take too long.
> >The cheap way out is to simply switch to the test-and-set spinlock on
> >whatever X86_FEATURE_ indicates a guest I suppose.
>
> I don't think there is X86_FEATURE flag that indicates running in a guest.
> In fact, a guest should never find out if it is running virtualized.

No it very much should; how else is paravirt ever going to work?

2014-02-19 08:52:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Tue, Feb 18, 2014 at 07:50:13PM -0500, Waiman Long wrote:
> On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>The #ifdef is harder to take away here. The point is that doing a 32-bit
> >>exchange may accidentally steal the lock with the additional code to handle
> >>that. Doing a 16-bit exchange, on the other hand, will never steal the lock
> >>and so don't need the extra handling code. I could construct a function with
> >>different return values to handle the different cases if you think it will
> >>make the code easier to read.
> >Does it really pay to use xchg() with all those fixup cases? Why not
> >have a single cmpxchg() loop that does just the exact atomic op you
> >want?
>
> The main reason for using xchg instead of cmpxchg is its performance impact
> when the lock is heavily contended. Under those circumstances, a task may
> need to do several tries of read+atomic-RMV before getting it right. This
> may cause a lot of cacheline contention. With xchg, we need at most 2 atomic
> ops. Using cmpxchg() does simplify the code a bit at the expense of
> performance with heavy contention.

Have you actually measured this?

2014-02-19 08:55:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On Tue, Feb 18, 2014 at 07:58:49PM -0500, Waiman Long wrote:
> On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>>>+ /*
> >>>>+ * At the head of the wait queue now
> >>>>+ */
> >>>>+ while (true) {
> >>>>+ u32 qcode;
> >>>>+ int retval;
> >>>>+
> >>>>+ retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
> >>>>+ if (retval> 0)
> >>>>+ ; /* Lock not available yet */
> >>>>+ else if (retval< 0)
> >>>>+ /* Lock taken, can release the node& return */
> >>>>+ goto release_node;
> >>>>+ else if (qcode != my_qcode) {
> >>>>+ /*
> >>>>+ * Just get the lock with other spinners waiting
> >>>>+ * in the queue.
> >>>>+ */
> >>>>+ if (queue_spin_trylock_unfair(lock))
> >>>>+ goto notify_next;
> >>>Why is this an option at all?
> >>>
> >>>
> >>Are you referring to the case (qcode != my_qcode)? This condition will be
> >>true if more than one tasks have queued up.
> >But in no case should we revert to unfair spinning or stealing. We
> >should always respect the queueing order.
> >
> >If the lock tail no longer points to us, then there's further waiters
> >and we should wait for ->next and unlock it -- after we've taken the
> >lock.
> >
>
> A task will be in this loop when it is already the head of a queue and is
> entitled to take the lock. The condition (qcode != my_qcode) is to decide
> whether it should just take the lock or take the lock & clear the code
> simultaneously. I am a bit cautious to use queue_spin_trylock_unfair() as
> there is a possibility that a CPU may run out of the queue node and need to
> do unfair busy spinning.

No; there is no such possibility. Add BUG_ON(idx>=4) and make sure of
it.

There's simply no more than 4 contexts what can nest at any one time:

task context
softirq context
hardirq context
nmi context

And someone contending a spinlock from NMI context should be shot
anyway.

Getting more nested spinlocks is an absolute hard fail.

2014-02-19 19:25:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>>> won't take too long.
>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>> whatever X86_FEATURE_ indicates a guest I suppose.
>> I don't think there is X86_FEATURE flag that indicates running in a guest.
>> In fact, a guest should never find out if it is running virtualized.
> No it very much should; how else is paravirt ever going to work?

We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
queue spinlock can be easily changed into an unfair lock which allows
lock stealing. We could have a config option to make it unfair in the
PARAVIRT environment, but I don't think Linus like the idea of an unfair
lock.

-Longman

2014-02-19 19:26:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/19/2014 03:52 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:50:13PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>> The #ifdef is harder to take away here. The point is that doing a 32-bit
>>>> exchange may accidentally steal the lock with the additional code to handle
>>>> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
>>>> and so don't need the extra handling code. I could construct a function with
>>>> different return values to handle the different cases if you think it will
>>>> make the code easier to read.
>>> Does it really pay to use xchg() with all those fixup cases? Why not
>>> have a single cmpxchg() loop that does just the exact atomic op you
>>> want?
>> The main reason for using xchg instead of cmpxchg is its performance impact
>> when the lock is heavily contended. Under those circumstances, a task may
>> need to do several tries of read+atomic-RMV before getting it right. This
>> may cause a lot of cacheline contention. With xchg, we need at most 2 atomic
>> ops. Using cmpxchg() does simplify the code a bit at the expense of
>> performance with heavy contention.
> Have you actually measured this?

I haven't actually measured that myself. It is mostly from my
experience. I could do some timing experiment with the cmpxchg() change
and report back to you later.

-Longman

2014-02-19 19:30:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation

On 02/19/2014 03:55 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:58:49PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>>>> + /*
>>>>>> + * At the head of the wait queue now
>>>>>> + */
>>>>>> + while (true) {
>>>>>> + u32 qcode;
>>>>>> + int retval;
>>>>>> +
>>>>>> + retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>>>> + if (retval> 0)
>>>>>> + ; /* Lock not available yet */
>>>>>> + else if (retval< 0)
>>>>>> + /* Lock taken, can release the node& return */
>>>>>> + goto release_node;
>>>>>> + else if (qcode != my_qcode) {
>>>>>> + /*
>>>>>> + * Just get the lock with other spinners waiting
>>>>>> + * in the queue.
>>>>>> + */
>>>>>> + if (queue_spin_trylock_unfair(lock))
>>>>>> + goto notify_next;
>>>>> Why is this an option at all?
>>>>>
>>>>>
>>>> Are you referring to the case (qcode != my_qcode)? This condition will be
>>>> true if more than one tasks have queued up.
>>> But in no case should we revert to unfair spinning or stealing. We
>>> should always respect the queueing order.
>>>
>>> If the lock tail no longer points to us, then there's further waiters
>>> and we should wait for ->next and unlock it -- after we've taken the
>>> lock.
>>>
>> A task will be in this loop when it is already the head of a queue and is
>> entitled to take the lock. The condition (qcode != my_qcode) is to decide
>> whether it should just take the lock or take the lock& clear the code
>> simultaneously. I am a bit cautious to use queue_spin_trylock_unfair() as
>> there is a possibility that a CPU may run out of the queue node and need to
>> do unfair busy spinning.
> No; there is no such possibility. Add BUG_ON(idx>=4) and make sure of
> it.

Yes, I could do that.

However in the generic implementation, I still need some kind of atomic
cmpxchg to set the lock bit. I could probably just do a simple
assignment of 1 to the lock byte in x86.

> There's simply no more than 4 contexts what can nest at any one time:
>
> task context
> softirq context
> hardirq context
> nmi context
>
> And someone contending a spinlock from NMI context should be shot
> anyway.
>
> Getting more nested spinlocks is an absolute hard fail.

2014-02-19 19:49:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Wed, Feb 19, 2014 at 02:24:49PM -0500, Waiman Long wrote:
> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
> >>On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> >>>On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> >>>>I will start looking at how to make it work with paravirt. Hopefully, it
> >>>>won't take too long.
> >>>The cheap way out is to simply switch to the test-and-set spinlock on
> >>>whatever X86_FEATURE_ indicates a guest I suppose.
> >>I don't think there is X86_FEATURE flag that indicates running in a guest.
> >>In fact, a guest should never find out if it is running virtualized.
> >No it very much should; how else is paravirt ever going to work?
>
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows lock
> stealing. We could have a config option to make it unfair in the PARAVIRT
> environment, but I don't think Linus like the idea of an unfair lock.

No; a guest is very much aware of paravirt. See for example the
static_key_false(&paravirt_ticketlocks_enabled). It would be impossible
to set that branch if you never knew you were a guest.

2014-02-19 20:17:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long <[email protected]> wrote:
>
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows lock
> stealing. We could have a config option to make it unfair in the PARAVIRT
> environment, but I don't think Linus like the idea of an unfair lock.

I could care less for the paravirt case. As long as the native case
ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
I'm fine.

When actually running in a paravirtualized environment, locks are
always going to have problems.

Linus

2014-02-19 21:30:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/19/2014 11:24 AM, Waiman Long wrote:
> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>>> I will start looking at how to make it work with paravirt.
>>>>> Hopefully, it
>>>>> won't take too long.
>>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>>> whatever X86_FEATURE_ indicates a guest I suppose.
>>> I don't think there is X86_FEATURE flag that indicates running in a
>>> guest.
>>> In fact, a guest should never find out if it is running virtualized.
>> No it very much should; how else is paravirt ever going to work?
>
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows
> lock stealing. We could have a config option to make it unfair in the
> PARAVIRT environment, but I don't think Linus like the idea of an unfair
> lock.
>

The case where we run native on a system with CONFIG_PARAVIRT enabled
DOES matter.

-hpa

2014-02-20 17:37:26

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/19/2014 02:48 PM, Peter Zijlstra wrote:
> On Wed, Feb 19, 2014 at 02:24:49PM -0500, Waiman Long wrote:
>> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>>>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>>>>> won't take too long.
>>>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>>>> whatever X86_FEATURE_ indicates a guest I suppose.
>>>> I don't think there is X86_FEATURE flag that indicates running in a guest.
>>>> In fact, a guest should never find out if it is running virtualized.
>>> No it very much should; how else is paravirt ever going to work?
>> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
>> queue spinlock can be easily changed into an unfair lock which allows lock
>> stealing. We could have a config option to make it unfair in the PARAVIRT
>> environment, but I don't think Linus like the idea of an unfair lock.
> No; a guest is very much aware of paravirt. See for example the
> static_key_false(&paravirt_ticketlocks_enabled). It would be impossible
> to set that branch if you never knew you were a guest.

Yes, that is true for paravirt, but I was talking about virtualization
in general.

-Longman

2014-02-20 17:44:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Thu, Feb 20, 2014 at 9:37 AM, Waiman Long <[email protected]> wrote:
>
> Yes, that is true for paravirt, but I was talking about virtualization in
> general.

For non-paravirtualized environments, we should basically optimize for
the actual hardware, and assume that the full virtualization is good
enough that we don't care.

Obviously, sometimes we might take "this is particularly expensive to
virtualize" into account, and if it doesn't hurt hardware, try to help
the virtual environment along. So you might have cases where certain
hardware (typically interrupt controllers etc) should be accessed
certain particular ways, because those not only work on real hardware,
but behave well in virtualized environments too.

But for spinlocks, just make the raw hardware work well when not
*actively* paravirtualized. If some hypervisor deals badly with it,
it's a hypervisor or management problem.

Linus

2014-02-20 17:54:18

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/19/2014 03:17 PM, Linus Torvalds wrote:
> On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long<[email protected]> wrote:
>> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
>> queue spinlock can be easily changed into an unfair lock which allows lock
>> stealing. We could have a config option to make it unfair in the PARAVIRT
>> environment, but I don't think Linus like the idea of an unfair lock.
> I could care less for the paravirt case. As long as the native case
> ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
> I'm fine.
>
> When actually running in a paravirtualized environment, locks are
> always going to have problems.
>
> Linus

I think we could implement 2 versions of _raw_spin_lock. The primary one
is fair and the other one is unfair with a jump label that jump from the
fair version to the unfair version. At boot time, the kernel can check
to see it is really running in a PV environment and then activate the
jump label to use the unfair version.

Now the key is how to detect if a kernel is really running in a PV
environment. I need to ask some virtualization experts on that.

-Longman

2014-02-20 18:16:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Thu, Feb 20, 2014 at 12:37:14PM -0500, Waiman Long wrote:
> Yes, that is true for paravirt, but I was talking about virtualization in
> general.

Them we really don't care about :-)

2014-02-20 18:42:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long <[email protected]> wrote:
>
> I think we could implement 2 versions of _raw_spin_lock.

Yup. Or rather, I'd suggest implement just one version of
arch_spin_lock(), but at the top of it you do something like

#if CONFIG_PARAVIRT_SPINLOCK
if (static_key_false(&unfair_spinlocks)) {
.. do paravirt unfair lock version ..
}
#endif

which should basically generate almost-perfect code: it's one extra
no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
turns into a branch for the unfair version for paravirtualization.

Or something like that.

Linus

2014-02-20 19:21:15

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/20/2014 01:42 PM, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long<[email protected]> wrote:
>> I think we could implement 2 versions of _raw_spin_lock.
> Yup. Or rather, I'd suggest implement just one version of
> arch_spin_lock(), but at the top of it you do something like
>
> #if CONFIG_PARAVIRT_SPINLOCK
> if (static_key_false(&unfair_spinlocks)) {
> .. do paravirt unfair lock version ..
> }
> #endif
>
> which should basically generate almost-perfect code: it's one extra
> no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
> turns into a branch for the unfair version for paravirtualization.
>
> Or something like that.
>
> Linus

Yes, this is actually what I meant. The only difference is that I am
thinking about using a different config variable as PARAVIRT_SPINLOCKS
actually mean something else.

-Longman

2014-02-20 19:26:17

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/20/2014 11:24 PM, Waiman Long wrote:

> Now the key is how to detect if a kernel is really running in a PV
> environment. I need to ask some virtualization experts on that.
>

For kvm you could just follow,
kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
part __do_cpuid_ent().

2014-02-21 17:03:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/20/2014 02:32 PM, Raghavendra K T wrote:
> On 02/20/2014 11:24 PM, Waiman Long wrote:
>
>> Now the key is how to detect if a kernel is really running in a PV
>> environment. I need to ask some virtualization experts on that.
>>
>
> For kvm you could just follow,
> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
> part __do_cpuid_ent().
>

I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the
returned value of kvm_para_available() to decide if it is running in a
KVM guest. However, I am not so sure of what to use in xen.

-Longman

2014-02-21 17:05:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On Fri, Feb 21, 2014 at 9:02 AM, Waiman Long <[email protected]> wrote:
> On 02/20/2014 02:32 PM, Raghavendra K T wrote:
>>
>> On 02/20/2014 11:24 PM, Waiman Long wrote:
>>
>>> Now the key is how to detect if a kernel is really running in a PV
>>> environment. I need to ask some virtualization experts on that.
>>>
>>
>> For kvm you could just follow,
>> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
>> part __do_cpuid_ent().
>>
>
> I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the
> returned value of kvm_para_available() to decide if it is running in a KVM
> guest. However, I am not so sure of what to use in xen.

Just use that 'paravirt_ticketlocks_enabled' static key. Xen enables it too.

Linus

2014-02-21 17:08:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>
> Why is this x86 only code?

The code is making use of the fact that byte write is atomic which is
true in x86 and probably in a few other architectures. I could pull
these codes into the generic qspinlock.c file and set a flag in the asm
header file to activate it if it is what you want.

-Longman

2014-02-21 17:10:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/21/2014 12:08 PM, Waiman Long wrote:
> On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>>
>> Why is this x86 only code?
>
> The code is making use of the fact that byte write is atomic which is
> true in x86 and probably in a few other architectures. I could pull
> these codes into the generic qspinlock.c file and set a flag in the
> asm header file to activate it if it is what you want.
>
> -Longman

BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.

-Longman

2014-02-21 17:27:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Fri, Feb 21, 2014 at 12:09:57PM -0500, Waiman Long wrote:
> On 02/21/2014 12:08 PM, Waiman Long wrote:
> >On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
> >>
> >>Why is this x86 only code?
> >
> >The code is making use of the fact that byte write is atomic which is true
> >in x86 and probably in a few other architectures. I could pull these codes
> >into the generic qspinlock.c file and set a flag in the asm header file to
> >activate it if it is what you want.
> >
> >-Longman
>
> BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.

Right, screw Alpha :-) Just pull it into the generic code; its far too
much code to replicate per arch.

2014-02-21 17:28:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Mon, Feb 17, 2014 at 03:41:24PM -0500, Waiman Long wrote:
> + struct {
> + u8 lock; /* Lock bit */
> + u8 wait; /* Waiting bit */
> + u16 qcode; /* Queue code */
> + };

16 bit code would result in 14 bits for the CPU number, that's only 16k,
I think SGI actually had a machine with that many CPUs in.

2014-02-22 01:36:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/21/2014 12:26 PM, Peter Zijlstra wrote:
> On Fri, Feb 21, 2014 at 12:09:57PM -0500, Waiman Long wrote:
>> On 02/21/2014 12:08 PM, Waiman Long wrote:
>>> On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>>>> Why is this x86 only code?
>>> The code is making use of the fact that byte write is atomic which is true
>>> in x86 and probably in a few other architectures. I could pull these codes
>>> into the generic qspinlock.c file and set a flag in the asm header file to
>>> activate it if it is what you want.
>>>
>>> -Longman
>> BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.
> Right, screw Alpha :-) Just pull it into the generic code; its far too
> much code to replicate per arch.

OK, I will do that in the next version.

-Longman

2014-02-22 01:39:48

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/21/2014 12:28 PM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:24PM -0500, Waiman Long wrote:
>> + struct {
>> + u8 lock; /* Lock bit */
>> + u8 wait; /* Waiting bit */
>> + u16 qcode; /* Queue code */
>> + };
> 16 bit code would result in 14 bits for the CPU number, that's only 16k,
> I think SGI actually had a machine with that many CPUs in.
>

If NR_CPUS >= 16k, the 2-task optimized code path will be disabled.
However, with that many CPUs, it is high spinlock contention behavior
that is more important than a bit better performance at low contention
level.

-Longman

2014-02-22 16:56:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock


* Peter Zijlstra <[email protected]> wrote:

> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> > This is starting to look good, so I have pulled it into
> > tip:x86/spinlocks to start give it some testing mileage.
>
> Its still atrociously ugly code please drop it.

Fair enough - I've excluded it from tip:master for now, until the
uglies get resolved.

Waiman: please address PeterZ's feedback.

Thanks,

Ingo

2014-02-25 03:37:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock

On 02/22/2014 11:56 AM, Ingo Molnar wrote:
> * Peter Zijlstra<[email protected]> wrote:
>
>> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
>>> This is starting to look good, so I have pulled it into
>>> tip:x86/spinlocks to start give it some testing mileage.
>> Its still atrociously ugly code please drop it.
> Fair enough - I've excluded it from tip:master for now, until the
> uglies get resolved.
>
> Waiman: please address PeterZ's feedback.
>
> Thanks,
>
> Ingo

I am trying to address PeterZ's feedback while adding
para-virtualization support right now. I will have a new version ready
sometime this week.

-Longman