v14->v15:
- Incorporate PeterZ's v15 qspinlock patch and improve upon the PV
qspinlock code by dynamically allocating the hash table as well
as some other performance optimization.
- Simplified the Xen PV qspinlock code as suggested by David Vrabel
<[email protected]>.
- Add benchmarking data for 3.19 kernel to compare the performance
of a spinlock heavy test with and without the qspinlock patch
under different cpufreq drivers and scaling governors.
v13->v14:
- Patches 1 & 2: Add queue_spin_unlock_wait() to accommodate commit
78bff1c86 from Oleg Nesterov.
- Fix the system hang problem when using PV qspinlock in an
over-committed guest due to a racing condition in the
pv_set_head_in_tail() function.
- Increase the MAYHALT_THRESHOLD from 10 to 1024.
- Change kick_cpu into a regular function pointer instead of a
callee-saved function.
- Change lock statistics code to use separate bits for different
statistics.
v12->v13:
- Change patch 9 to generate separate versions of the
queue_spin_lock_slowpath functions for bare metal and PV guest. This
reduces the performance impact of the PV code on bare metal systems.
v11->v12:
- Based on PeterZ's version of the qspinlock patch
(https://lkml.org/lkml/2014/6/15/63).
- Incorporated many of the review comments from Konrad Wilk and
Paolo Bonzini.
- The pvqspinlock code is largely from my previous version with
PeterZ's way of going from queue tail to head and his idea of
using callee saved calls to KVM and XEN codes.
v10->v11:
- Use a simple test-and-set unfair lock to simplify the code,
but performance may suffer a bit for large guest with many CPUs.
- Take out Raghavendra KT's test results as the unfair lock changes
may render some of his results invalid.
- Add PV support without increasing the size of the core queue node
structure.
- Other minor changes to address some of the feedback comments.
v9->v10:
- Make some minor changes to qspinlock.c to accommodate review feedback.
- Change author to PeterZ for 2 of the patches.
- Include Raghavendra KT's test results in patch 18.
v8->v9:
- Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/[email protected]
- Break the more complex patches into smaller ones to ease review effort.
- Fix a racing condition in the PV qspinlock code.
v7->v8:
- Remove one unneeded atomic operation from the slowpath, thus
improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
v6->v7:
- Remove an atomic operation from the 2-task contending code
- Shorten the names of some macros
- Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
- Remove lock holder kick from the PV code and fix a race condition
- Run the unfair lock & PV code on overcommitted KVM guests to collect
performance data.
v5->v6:
- Change the optimized 2-task contending code to make it fairer at the
expense of a bit of performance.
- Add a patch to support unfair queue spinlock for Xen.
- Modify the PV qspinlock code to follow what was done in the PV
ticketlock.
- Add performance data for the unfair lock as well as the PV
support code.
v4->v5:
- Move the optimized 2-task contending code to the generic file to
enable more architectures to use it without code duplication.
- Address some of the style-related comments by PeterZ.
- Allow the use of unfair queue spinlock in a real para-virtualized
execution environment.
- Add para-virtualization support to the qspinlock code by ensuring
that the lock holder and queue head stay alive as much as possible.
v3->v4:
- Remove debugging code and fix a configuration error
- Simplify the qspinlock structure and streamline the code to make it
perform a bit better
- Add an x86 version of asm/qspinlock.h for holding x86 specific
optimization.
- Add an optimized x86 code path for 2 contending tasks to improve
low contention performance.
v2->v3:
- Simplify the code by using numerous mode only without an unfair option.
- Use the latest smp_load_acquire()/smp_store_release() barriers.
- Move the queue spinlock code to kernel/locking.
- Make the use of queue spinlock the default for x86-64 without user
configuration.
- Additional performance tuning.
v1->v2:
- Add some more comments to document what the code does.
- Add a numerous CPU mode to support >= 16K CPUs
- Add a configuration option to allow lock stealing which can further
improve performance in many cases.
- Enable wakeup of queue head CPU at unlock time for non-numerous
CPU mode.
This patch set has 3 different sections:
1) Patches 1-6: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
2) Patch 7: Enables the use of unfair lock in a virtual guest. This
can resolve some of the locking related performance issues due to
halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.
3) Patches 8-15: Add paravirtualization spinlock support. This
is needed as paravirt spinlock was enabled by default on
distribution like RHEL 7.
The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention. This patch has the potential of
however, will not be able to support TSX.
The queue spinlock is especially suitable for NUMA machines with
at least 2 sockets. Though even at the 2-socket level, there can be
significant speedup depending on the workload.
The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.
The PV portion of the patch series is geared toward performance
optimization for bare metal. However, the qspinlock performance in
virtual guest should still be comparable with ticket spinlock at low
load and much better at high load.
Performance of kernel with qspinlock patch
------------------------------------------
In term of the performance benefit of this patch, I ran the
high_systime workload (which does a lot of fork() and exit())
at various load levels (500, 1000, 1500 and 2000 users) on a
4-socket IvyBridge-EX bare-metal system (60 cores, 120 threads)
with intel_pstate driver and performance scaling governor. The JPM
(jobs/minutes) and execution time results were as follows:
Kernel JPM Execution Time
------ --- --------------
At 500 users:
3.19 118857.14 26.25s
3.19-qspinlock 134889.75 23.13s
% change +13.5% -11.9%
At 1000 users:
3.19 204255.32 30.55s
3.19-qspinlock 239631.34 26.04s
% change +17.3% -14.8%
At 1500 users:
3.19 177272.73 52.80s
3.19-qspinlock 326132.40 28.70s
% change +84.0% -45.6%
At 2000 users:
3.19 196690.31 63.45s
3.19-qspinlock 341730.56 36.52s
% change +73.7% -42.4%
It turns out that this workload was causing quite a lot of spinlock
contention in the vanilla 3.19 kernel. The performance advantage of
this patch increases with heavier loads.
With the powersave governor, the JPM data were as follows:
Users 3.19 3.19-qspinlock % Change
----- ---- -------------- --------
500 112635.38 132596.69 +17.7%
1000 171240.40 240369.80 +40.4%
1500 130507.53 324436.74 +148.6%
2000 175972.93 341637.01 +94.1%
With the qspinlock patch, there wasn't too much difference in
performance between the 2 scaling governors. Without this patch,
the powersave governor was much slower than the performance governor.
By disabling the intel_pstate driver and use acpi_cpufreq instead,
the benchmark performance (JPM) at 1000 users level for the performance
and ondemand governors were:
Governor 3.19 3.19-qspinlock % Change
-------- ---- -------------- --------
performance 124949.94 219950.65 +76.0%
ondemand 4838.90 206690.96 +4171%
The performance was just horrible when there was significant spinlock
contention with the ondemand governor. There was also significant
run-to-run variation. A second run of the same benchmark gave a result
of 22115 JPMs. With the qspinlock patch, however, the performance was
much more stable under different cpufreq drivers and governors. That
is not the case with the default ticket spinlock implementation.
The %CPU times spent on spinlock contention (from perf) with the
performance governor and the intel_pstate driver were:
Kernel Function 3.19 kernel 3.19-qspinlock kernel
--------------- ----------- ---------------------
At 500 users:
_raw_spin_lock* 28.23% 2.25%
queue_spin_lock_slowpath N/A 4.05%
At 1000 users:
_raw_spin_lock* 23.21% 2.25%
queue_spin_lock_slowpath N/A 4.42%
At 1500 users:
_raw_spin_lock* 29.07% 2.24%
queue_spin_lock_slowpath N/A 4.49%
At 2000 users:
_raw_spin_lock* 29.15% 2.26%
queue_spin_lock_slowpath N/A 4.82%
The top spinlock related entries in the perf profile for the 3.19
kernel at 1000 users were:
7.40% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--58.96%-- rwsem_wake
|--20.02%-- release_pages
|--15.88%-- pagevec_lru_move_fn
|--1.53%-- get_page_from_freelist
|--0.78%-- __wake_up
|--0.55%-- try_to_wake_up
--2.28%-- [...]
3.13% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--37.55%-- free_one_page
|--17.47%-- __cache_free_alien
|--4.95%-- __rcu_process_callbacks
|--2.93%-- __pte_alloc
|--2.68%-- __drain_alien_cache
|--2.56%-- ext4_do_update_inode
|--2.54%-- try_to_wake_up
|--2.46%-- pgd_free
|--2.32%-- cache_alloc_refill
|--2.32%-- pgd_alloc
|--2.32%-- free_pcppages_bulk
|--1.88%-- do_wp_page
|--1.77%-- handle_pte_fault
|--1.58%-- do_anonymous_page
|--1.56%-- rmqueue_bulk.clone.0
|--1.35%-- copy_pte_range
|--1.25%-- zap_pte_range
|--1.13%-- cache_flusharray
|--0.88%-- __pmd_alloc
|--0.70%-- wake_up_new_task
|--0.66%-- __pud_alloc
|--0.59%-- ext4_discard_preallocations
--6.53%-- [...]
With the qspinlock patch, the perf profile at 1000 users was:
3.25% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath
|--62.00%-- _raw_spin_lock_irqsave
| |--77.49%-- rwsem_wake
| |--11.99%-- release_pages
| |--4.34%-- pagevec_lru_move_fn
| |--1.93%-- get_page_from_freelist
| |--1.90%-- prepare_to_wait_exclusive
| |--1.29%-- __wake_up
| |--0.74%-- finish_wait
|--11.63%-- _raw_spin_lock
| |--31.11%-- try_to_wake_up
| |--18.48%-- free_one_page
| |--13.97%-- __cache_free_alien
| |--7.77%-- free_pcppages_bulk
| |--7.12%-- __drain_alien_cache
| |--6.17%-- rmqueue_bulk.clone.0
| |--4.17%-- __rcu_process_callbacks
| |--2.22%-- cache_alloc_refill
| |--2.15%-- wake_up_new_task
| |--1.80%-- ext4_do_update_inode
| |--1.52%-- cache_flusharray
| |--0.89%-- __mutex_unlock_slowpath
| |--0.64%-- ttwu_queue
|--11.19%-- _raw_spin_lock_irq
| |--98.95%-- rwsem_down_write_failed
| |--0.93%-- __schedule
|--7.91%-- queue_read_lock_slowpath
| _raw_read_lock
| |--96.79%-- do_wait
| |--2.44%-- do_prlimit
| chrdev_open
| do_dentry_open
| vfs_open
| do_last
| path_openat
| do_filp_open
| do_sys_open
| sys_open
| system_call
| __GI___libc_open
|--7.05%-- queue_write_lock_slowpath
| _raw_write_lock_irq
| |--35.36%-- release_task
| |--32.76%-- copy_process
| do_exit
| do_group_exit
| sys_exit_group
| system_call
--0.22%-- [...]
This demonstrates the benefit of this patch for those applications
that run on multi-socket machines and can cause significant spinlock
contentions in the kernel.
I also got report that the queue spinlock patch can improve the
performance of an I/O and interrupt intensive stress test with a lot
of spinlock contention on a 2-socket system by up to 20%.
Daniel Blueman of NumaScale had tested the v14 qspinlock patch on an
older 512-core NumaConnect system for testing with 3.19.1:
$ src/reaim -c data/reaim.config -f data/workfile.compute -i 16 -e 256
Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI
Forked Time SysTime UTime Minute Child Time Percent
1 2.15 0.36 1.19 2818.60 2818.60 0.00 0.00 100
17 6.29 42.41 22.43 16378.38 963.43 0.27 4.43 95
33 56.24 1611.77 47.04 3555.83 107.75 1.58 2.88 97
49 120.88 5576.87 63.54 2456.49 50.13 1.75 1.47 98
65 199.17 12328.42 81.27 1977.71 30.43 2.42 1.23 98
81 270.89 20943.99 103.99 1812.03 22.37 3.85 1.44 98
97 315.31 29211.63 121.30 1864.26 19.22 4.48 1.44 98
113 397.68 42766.15 144.48 1721.94 15.24 7.04 1.80 98
129 520.74 63442.27 162.03 1501.21 11.64 10.24 1.99 98
...
Patched with qspinlocks v14:
Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI
Forked Time SysTime UTime Minute Child Time Percent
1 1.98 0.36 1.21 3060.61 3060.61 0.00 0.00 100
17 5.60 25.41 22.39 18396.43 1082.14 0.16 3.00 97
33 12.67 153.64 49.17 15783.74 478.30 0.50 4.16 95
49 21.15 365.55 76.61 14039.72 286.52 0.83 4.19 95
65 27.57 556.87 105.96 14287.27 219.80 0.75 2.80 97
81 33.07 712.92 127.85 14843.06 183.25 1.16 3.69 96
97 42.81 1009.02 161.87 13730.90 141.56 1.75 4.27 95
113 49.13 1166.04 185.10 13938.12 123.35 2.11 4.53 95
129 56.62 1262.07 231.92 13806.78 107.03 2.67 4.99 95
145 66.41 1420.29 250.24 13231.44 91.25 2.95 4.70 95
179 90.15 2237.78 325.18 12032.61 67.22 4.23 4.94 95
193 85.41 1865.81 326.64 13693.71 70.95 4.29 5.32 94
220 93.58 2298.77 376.46 14246.63 64.76 4.97 5.65 94
Max Jobs per Minute 18396.43
The compute workload is mostly userspace code with some synchronization
between working threads (HPC type workload). With large user count, the
patched kernel has close to 8X the performance of an unpatched kernel.
Peter Zijlstra (Intel) (4):
qspinlock: Add pending bit
qspinlock: Optimize for smaller NR_CPUS
qspinlock: Revert to test-and-set on hypervisors
pvqspinlock: Implement the paravirt qspinlock for x86
Waiman Long (11):
qspinlock: A simple generic 4-byte queue spinlock
qspinlock, x86: Enable x86-64 to use queue spinlock
qspinlock: Extract out code snippets for the next patch
qspinlock: Use a simple write to grab the lock
lfsr: a simple binary Galois linear feedback shift register
pvqspinlock: Implement simple paravirt support for the qspinlock
pvqspinlock, x86: Enable PV qspinlock for KVM
pvqspinlock, x86: Enable PV qspinlock for Xen
pvqspinlock: Only kick CPU at unlock time
pvqspinlock: Improve slowpath performance by avoiding cmpxchg
pvqspinlock: Add debug code to check for PV lock hash sanity
arch/x86/Kconfig | 3 +-
arch/x86/include/asm/paravirt.h | 28 ++-
arch/x86/include/asm/paravirt_types.h | 10 +
arch/x86/include/asm/qspinlock.h | 57 ++++
arch/x86/include/asm/spinlock.h | 5 +
arch/x86/include/asm/spinlock_types.h | 4 +
arch/x86/kernel/kvm.c | 43 +++
arch/x86/kernel/paravirt-spinlocks.c | 24 ++-
arch/x86/kernel/paravirt_patch_32.c | 22 ++-
arch/x86/kernel/paravirt_patch_64.c | 22 ++-
arch/x86/xen/spinlock.c | 63 ++++-
include/asm-generic/qspinlock.h | 139 ++++++++++
include/asm-generic/qspinlock_types.h | 79 ++++++
include/linux/lfsr.h | 80 ++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/mcs_spinlock.h | 1 +
kernel/locking/qspinlock.c | 474 +++++++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 433 ++++++++++++++++++++++++++++++
19 files changed, 1480 insertions(+), 15 deletions(-)
create mode 100644 arch/x86/include/asm/qspinlock.h
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 include/linux/lfsr.h
create mode 100644 kernel/locking/qspinlock.c
create mode 100644 kernel/locking/qspinlock_paravirt.h
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.
Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.
This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.
Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities. By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.
Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/asm-generic/qspinlock.h | 132 +++++++++++++++++++++
include/asm-generic/qspinlock_types.h | 58 +++++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/mcs_spinlock.h | 1 +
kernel/locking/qspinlock.c | 209 +++++++++++++++++++++++++++++++++
6 files changed, 408 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qspinlock.h
create mode 100644 include/asm-generic/qspinlock_types.h
create mode 100644 kernel/locking/qspinlock.c
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..315d6dc
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+ return atomic_read(&lock->val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ * locked wrt the lockref code to avoid lock stealing by the lockref
+ * code and change things underneath the lock. This also allows some
+ * optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+ return !atomic_read(&lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+ return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+ if (!atomic_read(&lock->val) &&
+ (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+ return 1;
+ return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+ u32 val;
+
+ val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+ if (likely(val == 0))
+ return;
+ queue_spin_lock_slowpath(lock, val);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ /*
+ * smp_mb__before_atomic() in order to guarantee release semantics
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_Q_LOCKED_VAL, &lock->val);
+}
+#endif
+
+/**
+ * queue_spin_unlock_wait - wait until current lock holder releases the lock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * There is a very slight possibility of live-lock if the lockers keep coming
+ * and the waiter is just unfortunate enough to not see any unlock state.
+ */
+static inline void queue_spin_unlock_wait(struct qspinlock *lock)
+{
+ while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
+ cpu_relax();
+}
+
+/*
+ * Initializier
+ */
+#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l) queue_spin_is_locked(l)
+#define arch_spin_is_contended(l) queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l) queue_spin_value_unlocked(l)
+#define arch_spin_lock(l) queue_spin_lock(l)
+#define arch_spin_trylock(l) queue_spin_trylock(l)
+#define arch_spin_unlock(l) queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f) queue_spin_lock(l)
+#define arch_spin_unlock_wait(l) queue_spin_unlock_wait(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..c9348d8
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,58 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. So don't include
+ * it if PARAVIRT is on.
+ */
+#ifndef CONFIG_PARAVIRT
+#include <linux/types.h>
+#include <linux/atomic.h>
+#endif
+
+typedef struct qspinlock {
+ atomic_t val;
+} arch_spinlock_t;
+
+/*
+ * Bitfields in the atomic value:
+ *
+ * 0- 7: locked byte
+ * 8- 9: tail index
+ * 10-31: tail cpu (+1)
+ */
+#define _Q_SET_MASK(type) (((1U << _Q_ ## type ## _BITS) - 1)\
+ << _Q_ ## type ## _OFFSET)
+#define _Q_LOCKED_OFFSET 0
+#define _Q_LOCKED_BITS 8
+#define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_TAIL_IDX_BITS 2
+#define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX)
+
+#define _Q_TAIL_CPU_OFFSET (_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
+
+#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 08561f1..c6a8f7c 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -235,6 +235,13 @@ config LOCK_SPIN_ON_OWNER
def_bool y
depends on MUTEX_SPIN_ON_OWNER || RWSEM_SPIN_ON_OWNER
+config ARCH_USE_QUEUE_SPINLOCK
+ bool
+
+config QUEUE_SPINLOCK
+ def_bool y if ARCH_USE_QUEUE_SPINLOCK
+ depends on SMP && !PARAVIRT_SPINLOCKS
+
config ARCH_USE_QUEUE_RWLOCK
bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index de7a416..66a896f 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
obj-$(CONFIG_SMP) += lglock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index d1fe2ba..51370d2 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,6 +17,7 @@
struct mcs_spinlock {
struct mcs_spinlock *next;
int locked; /* 1 if lock acquired */
+ int count; /* nesting count, see qspinlock.c */
};
#ifndef arch_mcs_spin_lock_contended
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..3456819
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,209 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2015 Intel Corp.
+ *
+ * Authors: Waiman Long <[email protected]>
+ * Peter Zijlstra <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm/qspinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock, however to make
+ * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
+ * API, we must modify it somehow.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
+ * are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
+ * we can encode the tail by combining the 2-bit nesting level with the cpu
+ * number. With one byte for the lock value and 3 bytes for the tail, only a
+ * 32-bit word is now needed. Even though we only need 1 bit for the lock,
+ * we extend it to a full byte to achieve better performance for architectures
+ * that support atomic byte write.
+ *
+ * We also change the first spinner to spin on the lock bit instead of its
+ * node; whereby avoiding the need to carry a node from lock to unlock, and
+ * preserving existing lock API. This also makes the unlock code simpler and
+ * faster.
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+ u32 tail;
+
+#ifdef CONFIG_DEBUG_SPINLOCK
+ BUG_ON(idx > 3);
+#endif
+ tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+ tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+ return tail;
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+ int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+ int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
+
+ return per_cpu_ptr(&mcs_nodes[idx], cpu);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @val: Current value of the queue spinlock 32-bit word
+ *
+ * (queue tail, lock value)
+ *
+ * fast : slow : unlock
+ * : :
+ * uncontended (0,0) --:--> (0,1) --------------------------------:--> (*,0)
+ * : | ^--------. / :
+ * : v \ | :
+ * uncontended : (n,x) --+--> (n,0) | :
+ * queue : | ^--' | :
+ * : v | :
+ * contended : (*,x) --+--> (*,0) -----> (*,1) ---' :
+ * queue : ^--' :
+ *
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ struct mcs_spinlock *prev, *next, *node;
+ u32 new, old, tail;
+ int idx;
+
+ BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+ node = this_cpu_ptr(&mcs_nodes[0]);
+ idx = node->count++;
+ tail = encode_tail(smp_processor_id(), idx);
+
+ node += idx;
+ node->locked = 0;
+ node->next = NULL;
+
+ /*
+ * trylock || xchg(lock, node)
+ *
+ * 0,0 -> 0,1 ; no tail, not locked -> no tail, locked.
+ * p,x -> n,x ; tail was p -> tail is n; preserving locked.
+ */
+ for (;;) {
+ new = _Q_LOCKED_VAL;
+ if (val)
+ new = tail | (val & _Q_LOCKED_MASK);
+
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+
+ /*
+ * we won the trylock; forget about queueing.
+ */
+ if (new == _Q_LOCKED_VAL)
+ goto release;
+
+ /*
+ * if there was a previous node; link it and wait until reaching the
+ * head of the waitqueue.
+ */
+ if (old & ~_Q_LOCKED_MASK) {
+ prev = decode_tail(old);
+ WRITE_ONCE(prev->next, node);
+
+ arch_mcs_spin_lock_contended(&node->locked);
+ }
+
+ /*
+ * we're at the head of the waitqueue, wait for the owner to go away.
+ *
+ * *,x -> *,0
+ */
+ while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+ cpu_relax();
+
+ /*
+ * claim the lock:
+ *
+ * n,0 -> 0,1 : lock, uncontended
+ * *,0 -> *,1 : lock, contended
+ */
+ for (;;) {
+ new = _Q_LOCKED_VAL;
+ if (val != tail)
+ new |= val;
+
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+
+ /*
+ * contended path; wait for next, release.
+ */
+ if (new != _Q_LOCKED_VAL) {
+ while (!(next = READ_ONCE(node->next)))
+ cpu_relax();
+
+ arch_mcs_spin_unlock_contended(&next->locked);
+ }
+
+release:
+ /*
+ * release the node
+ */
+ this_cpu_dec(mcs_nodes[0].count);
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
--
1.7.1
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.
Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.
The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/qspinlock.h | 20 ++++++++++++++++++++
arch/x86/include/asm/spinlock.h | 5 +++++
arch/x86/include/asm/spinlock_types.h | 4 ++++
4 files changed, 30 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/qspinlock.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..49fecb1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+ select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..222995b
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#define queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ smp_store_release((u8 *)lock, 0);
+}
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index cf87de3..a9c01fd 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS
static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
cpu_relax();
}
}
+#endif /* CONFIG_QUEUE_SPINLOCK */
/*
* Read-write spinlocks, allowing multiple readers
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
#define TICKET_SHIFT (sizeof(__ticket_t) * 8)
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
} arch_spinlock_t;
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
#include <asm-generic/qrwlock_types.h>
--
1.7.1
From: Peter Zijlstra (Intel) <[email protected]>
Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.
It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.
In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/qspinlock_types.h | 12 +++-
kernel/locking/qspinlock.c | 119 +++++++++++++++++++++++++++------
2 files changed, 107 insertions(+), 24 deletions(-)
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index c9348d8..9c3f5c2 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
* Bitfields in the atomic value:
*
* 0- 7: locked byte
- * 8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ * 9-10: tail index
+ * 11-31: tail cpu (+1)
*/
#define _Q_SET_MASK(type) (((1U << _Q_ ## type ## _BITS) - 1)\
<< _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
#define _Q_LOCKED_BITS 8
#define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS 1
+#define _Q_PENDING_MASK _Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
#define _Q_TAIL_IDX_BITS 2
#define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX)
@@ -54,5 +59,6 @@ typedef struct qspinlock {
#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)
#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3456819..0351f78 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(&mcs_nodes[idx], cpu);
}
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
/**
* queue_spin_lock_slowpath - acquire the queue spinlock
* @lock: Pointer to queue spinlock structure
* @val: Current value of the queue spinlock 32-bit word
*
- * (queue tail, lock value)
- *
- * fast : slow : unlock
- * : :
- * uncontended (0,0) --:--> (0,1) --------------------------------:--> (*,0)
- * : | ^--------. / :
- * : v \ | :
- * uncontended : (n,x) --+--> (n,0) | :
- * queue : | ^--' | :
- * : v | :
- * contended : (*,x) --+--> (*,0) -----> (*,1) ---' :
- * queue : ^--' :
+ * (queue tail, pending bit, lock value)
*
+ * fast : slow : unlock
+ * : :
+ * uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
+ * : | ^--------.------. / :
+ * : v \ \ | :
+ * pending : (0,1,1) +--> (0,1,0) \ | :
+ * : | ^--' | | :
+ * : v | | :
+ * uncontended : (n,x,y) +--> (n,0,0) --' | :
+ * queue : | ^--' | :
+ * : v | :
+ * contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
+ * queue : ^--' :
*/
void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
@@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+ /*
+ * wait for in-progress pending->locked hand-overs
+ *
+ * 0,1,0 -> 0,0,1
+ */
+ if (val == _Q_PENDING_VAL) {
+ while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
+ cpu_relax();
+ }
+
+ /*
+ * trylock || pending
+ *
+ * 0,0,0 -> 0,0,1 ; trylock
+ * 0,0,1 -> 0,1,1 ; pending
+ */
+ for (;;) {
+ /*
+ * If we observe any contention; queue.
+ */
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+
+ /*
+ * we won the trylock
+ */
+ if (new == _Q_LOCKED_VAL)
+ return;
+
+ /*
+ * we're pending, wait for the owner to go away.
+ *
+ * *,1,1 -> *,1,0
+ */
+ while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+ cpu_relax();
+
+ /*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+ for (;;) {
+ new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+ return;
+
+ /*
+ * End of pending bit optimistic spinning and beginning of MCS
+ * queuing.
+ */
+queue:
node = this_cpu_ptr(&mcs_nodes[0]);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -130,15 +203,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
node->next = NULL;
/*
+ * We have already touched the queueing cacheline; don't bother with
+ * pending stuff.
+ *
* trylock || xchg(lock, node)
*
- * 0,0 -> 0,1 ; no tail, not locked -> no tail, locked.
- * p,x -> n,x ; tail was p -> tail is n; preserving locked.
+ * 0,0,0 -> 0,0,1 ; no tail, not locked -> no tail, locked.
+ * p,y,x -> n,y,x ; tail was p -> tail is n; preserving locked.
*/
for (;;) {
new = _Q_LOCKED_VAL;
if (val)
- new = tail | (val & _Q_LOCKED_MASK);
+ new = tail | (val & _Q_LOCKED_PENDING_MASK);
old = atomic_cmpxchg(&lock->val, val, new);
if (old == val)
@@ -157,7 +233,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
- if (old & ~_Q_LOCKED_MASK) {
+ if (old & ~_Q_LOCKED_PENDING_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
@@ -165,18 +241,19 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
}
/*
- * we're at the head of the waitqueue, wait for the owner to go away.
+ * we're at the head of the waitqueue, wait for the owner & pending to
+ * go away.
*
- * *,x -> *,0
+ * *,x,y -> *,0,0
*/
- while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+ while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
cpu_relax();
/*
* claim the lock:
*
- * n,0 -> 0,1 : lock, uncontended
- * *,0 -> *,1 : lock, contended
+ * n,0,0 -> 0,0,1 : lock, uncontended
+ * *,0,0 -> *,0,1 : lock, contended
*/
for (;;) {
new = _Q_LOCKED_VAL;
--
1.7.1
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.
1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.
This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/asm-generic/qspinlock_types.h | 2 +
kernel/locking/qspinlock.c | 79 ++++++++++++++++++++-------------
2 files changed, 50 insertions(+), 31 deletions(-)
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 9c3f5c2..ef36613 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
+#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0351f78..11f6ad9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+ u32 old, new, val = atomic_read(&lock->val);
+
+ for (;;) {
+ new = (val & _Q_LOCKED_PENDING_MASK) | tail;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+ return old;
+}
+
+/**
* queue_spin_lock_slowpath - acquire the queue spinlock
* @lock: Pointer to queue spinlock structure
* @val: Current value of the queue spinlock 32-bit word
@@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
*
* *,1,0 -> *,0,1
*/
- for (;;) {
- new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
- old = atomic_cmpxchg(&lock->val, val, new);
- if (old == val)
- break;
-
- val = old;
- }
+ clear_pending_set_locked(lock);
return;
/*
@@ -203,37 +231,26 @@ queue:
node->next = NULL;
/*
- * We have already touched the queueing cacheline; don't bother with
- * pending stuff.
- *
- * trylock || xchg(lock, node)
- *
- * 0,0,0 -> 0,0,1 ; no tail, not locked -> no tail, locked.
- * p,y,x -> n,y,x ; tail was p -> tail is n; preserving locked.
+ * We touched a (possibly) cold cacheline in the per-cpu queue node;
+ * attempt the trylock once more in the hope someone let go while we
+ * weren't watching.
*/
- for (;;) {
- new = _Q_LOCKED_VAL;
- if (val)
- new = tail | (val & _Q_LOCKED_PENDING_MASK);
-
- old = atomic_cmpxchg(&lock->val, val, new);
- if (old == val)
- break;
-
- val = old;
- }
+ if (queue_spin_trylock(lock))
+ goto release;
/*
- * we won the trylock; forget about queueing.
+ * We have already touched the queueing cacheline; don't bother with
+ * pending stuff.
+ *
+ * p,*,* -> n,*,*
*/
- if (new == _Q_LOCKED_VAL)
- goto release;
+ old = xchg_tail(lock, tail);
/*
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
- if (old & ~_Q_LOCKED_PENDING_MASK) {
+ if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
--
1.7.1
From: Peter Zijlstra (Intel) <[email protected]>
When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.
By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.
This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.
This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.
All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/qspinlock_types.h | 13 ++++++
kernel/locking/qspinlock.c | 69 ++++++++++++++++++++++++++++++++-
2 files changed, 81 insertions(+), 1 deletions(-)
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index ef36613..f01b55d 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
/*
* Bitfields in the atomic value:
*
+ * When NR_CPUS < 16K
+ * 0- 7: locked byte
+ * 8: pending
+ * 9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS >= 16K
* 0- 7: locked byte
* 8: pending
* 9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
#define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS 8
+#else
#define _Q_PENDING_BITS 1
+#endif
#define _Q_PENDING_MASK _Q_SET_MASK(PENDING)
#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 11f6ad9..bcc99e6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
#include <linux/percpu.h>
#include <linux/hardirq.h>
#include <linux/mutex.h>
+#include <asm/byteorder.h>
#include <asm/qspinlock.h>
/*
@@ -56,6 +57,10 @@
* node; whereby avoiding the need to carry a node from lock to unlock, and
* preserving existing lock API. This also makes the unlock code simpler and
* faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ * atomic operations on smaller 8-bit and 16-bit data types.
+ *
*/
#include "mcs_spinlock.h"
@@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+ union {
+ atomic_t val;
+ struct {
+#ifdef __LITTLE_ENDIAN
+ u16 locked_pending;
+ u16 tail;
+#else
+ u16 tail;
+ u16 locked_pending;
+#endif
+ };
+ };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
/**
* clear_pending_set_locked - take ownership and clear the pending bit.
* @lock: Pointer to queue spinlock structure
@@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
}
return old;
}
+#endif /* _Q_PENDING_BITS == 8 */
/**
* queue_spin_lock_slowpath - acquire the queue spinlock
@@ -205,8 +267,13 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* we're pending, wait for the owner to go away.
*
* *,1,1 -> *,1,0
+ *
+ * this wait loop must be a load-acquire such that we match the
+ * store-release that clears the locked bit and create lock
+ * sequentiality; this is because not all clear_pending_set_locked()
+ * implementations imply full barriers.
*/
- while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+ while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
cpu_relax();
/*
--
1.7.1
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.
With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.
[Standalone/Embedded - same node]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 2324/2321 2248/2265 -3%/-2%
4 2890/2896 2819/2831 -2%/-2%
5 3611/3595 3522/3512 -2%/-2%
6 4281/4276 4173/4160 -3%/-3%
7 5018/5001 4875/4861 -3%/-3%
8 5759/5750 5563/5568 -3%/-3%
[Standalone/Embedded - different nodes]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 12242/12237 12087/12093 -1%/-1%
4 10688/10696 10507/10521 -2%/-2%
It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.
The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:
AIM7 XFS Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 5678233 3.17 96.61 5.81
qspinlock 5750799 3.13 94.83 5.97
AIM7 EXT4 Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 1114551 16.15 509.72 7.11
qspinlock 2184466 8.24 232.99 6.01
The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.
The "ebizzy -m" test was also run with the following results:
kernel records/s Real Time Sys Time Usr Time
----- --------- --------- -------- --------
ticketlock 2075 10.00 216.35 3.49
qspinlock 3023 10.00 198.20 4.80
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/locking/qspinlock.c | 66 +++++++++++++++++++++++++++++++++----------
1 files changed, 50 insertions(+), 16 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bcc99e6..99503ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
* By using the whole 2nd least significant byte for the pending bit, we
* can allow better optimization of the lock acquisition for the pending
* bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
*/
-#if _Q_PENDING_BITS == 8
-
struct __qspinlock {
union {
atomic_t val;
- struct {
#ifdef __LITTLE_ENDIAN
+ struct {
+ u8 locked;
+ u8 pending;
+ };
+ struct {
u16 locked_pending;
u16 tail;
+ };
#else
+ struct {
u16 tail;
u16 locked_pending;
-#endif
};
+ struct {
+ u8 reserved[2];
+ u8 pending;
+ u8 locked;
+ };
+#endif
};
};
+#if _Q_PENDING_BITS == 8
/**
* clear_pending_set_locked - take ownership and clear the pending bit.
* @lock: Pointer to queue spinlock structure
@@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
#endif /* _Q_PENDING_BITS == 8 */
/**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 -> *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+}
+
+/**
* queue_spin_lock_slowpath - acquire the queue spinlock
* @lock: Pointer to queue spinlock structure
* @val: Current value of the queue spinlock 32-bit word
@@ -329,8 +355,14 @@ queue:
* go away.
*
* *,x,y -> *,0,0
+ *
+ * this wait loop must use a load-acquire such that we match the
+ * store-release that clears the locked bit and create lock
+ * sequentiality; this is because the set_locked() function below
+ * does not imply a full barrier.
+ *
*/
- while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
+ while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
cpu_relax();
/*
@@ -338,15 +370,19 @@ queue:
*
* n,0,0 -> 0,0,1 : lock, uncontended
* *,0,0 -> *,0,1 : lock, contended
+ *
+ * If the queue head is the only one in the queue (lock value == tail),
+ * clear the tail code and grab the lock. Otherwise, we only need
+ * to grab the lock.
*/
for (;;) {
- new = _Q_LOCKED_VAL;
- if (val != tail)
- new |= val;
-
- old = atomic_cmpxchg(&lock->val, val, new);
- if (old == val)
+ if (val != tail) {
+ set_locked(lock);
break;
+ }
+ old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+ if (old == val)
+ goto release; /* No contention */
val = old;
}
@@ -354,12 +390,10 @@ queue:
/*
* contended path; wait for next, release.
*/
- if (new != _Q_LOCKED_VAL) {
- while (!(next = READ_ONCE(node->next)))
- cpu_relax();
+ while (!(next = READ_ONCE(node->next)))
+ cpu_relax();
- arch_mcs_spin_unlock_contended(&next->locked);
- }
+ arch_mcs_spin_unlock_contended(&next->locked);
release:
/*
--
1.7.1
From: Peter Zijlstra (Intel) <[email protected]>
When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 14 ++++++++++++++
include/asm-generic/qspinlock.h | 7 +++++++
kernel/locking/qspinlock.c | 3 +++
3 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 222995b..64c925e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
#ifndef _ASM_X86_QSPINLOCK_H
#define _ASM_X86_QSPINLOCK_H
+#include <asm/cpufeature.h>
#include <asm-generic/qspinlock_types.h>
#define queue_spin_unlock queue_spin_unlock
@@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
smp_store_release((u8 *)lock, 0);
}
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+ if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+ return false;
+
+ while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
+ cpu_relax();
+
+ return true;
+}
+
#include <asm-generic/qspinlock.h>
#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 315d6dc..bcbbc5e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock *lock)
cpu_relax();
}
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+ return false;
+}
+#endif
+
/*
* Initializier
*/
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 99503ef..fc2e5ab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+ if (virt_queue_spin_lock(lock))
+ return;
+
/*
* wait for in-progress pending->locked hand-overs
*
--
1.7.1
This patch is based on the code sent out by Peter Zijstra as part
of his queue spinlock patch to provide a hashing function with open
addressing. The lfsr() function can be used to return a sequence of
numbers that cycle through all the bit patterns (2^n -1) of a given
bit width n except the value 0 in a somewhat random fashion depending
on the LFSR taps that is being used. Callers can provide their own
taps value or use the default.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/lfsr.h | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 80 insertions(+), 0 deletions(-)
create mode 100644 include/linux/lfsr.h
diff --git a/include/linux/lfsr.h b/include/linux/lfsr.h
new file mode 100644
index 0000000..f570819
--- /dev/null
+++ b/include/linux/lfsr.h
@@ -0,0 +1,80 @@
+#ifndef _LINUX_LFSR_H
+#define _LINUX_LFSR_H
+
+/*
+ * Simple Binary Galois Linear Feedback Shift Register
+ *
+ * http://en.wikipedia.org/wiki/Linear_feedback_shift_register
+ *
+ * This function only currently supports only bits values of 4-30. Callers
+ * that doesn't pass in a constant bits value can optionally define
+ * LFSR_MIN_BITS and LFSR_MAX_BITS before including the lfsr.h header file
+ * to reduce the size of the jump table in the compiled code, if desired.
+ */
+#ifndef LFSR_MIN_BITS
+#define LFSR_MIN_BITS 4
+#endif
+
+#ifndef LFSR_MAX_BITS
+#define LFSR_MAX_BITS 30
+#endif
+
+static __always_inline u32 lfsr_taps(int bits)
+{
+ BUG_ON((bits < LFSR_MIN_BITS) || (bits > LFSR_MAX_BITS));
+ BUILD_BUG_ON((LFSR_MIN_BITS < 4) || (LFSR_MAX_BITS > 30));
+
+#define _IF_BITS_EQ(x) \
+ if (((x) >= LFSR_MIN_BITS) && ((x) <= LFSR_MAX_BITS) && ((x) == bits))
+
+ /*
+ * Feedback terms copied from
+ * http://users.ece.cmu.edu/~koopman/lfsr/index.html
+ */
+ _IF_BITS_EQ(4) return 0x0009;
+ _IF_BITS_EQ(5) return 0x0012;
+ _IF_BITS_EQ(6) return 0x0021;
+ _IF_BITS_EQ(7) return 0x0041;
+ _IF_BITS_EQ(8) return 0x008E;
+ _IF_BITS_EQ(9) return 0x0108;
+ _IF_BITS_EQ(10) return 0x0204;
+ _IF_BITS_EQ(11) return 0x0402;
+ _IF_BITS_EQ(12) return 0x0829;
+ _IF_BITS_EQ(13) return 0x100D;
+ _IF_BITS_EQ(14) return 0x2015;
+ _IF_BITS_EQ(15) return 0x4122;
+ _IF_BITS_EQ(16) return 0x8112;
+ _IF_BITS_EQ(17) return 0x102C9;
+ _IF_BITS_EQ(18) return 0x20195;
+ _IF_BITS_EQ(19) return 0x403FE;
+ _IF_BITS_EQ(20) return 0x80637;
+ _IF_BITS_EQ(21) return 0x100478;
+ _IF_BITS_EQ(22) return 0x20069E;
+ _IF_BITS_EQ(23) return 0x4004B2;
+ _IF_BITS_EQ(24) return 0x800B87;
+ _IF_BITS_EQ(25) return 0x10004F3;
+ _IF_BITS_EQ(26) return 0x200072D;
+ _IF_BITS_EQ(27) return 0x40006AE;
+ _IF_BITS_EQ(28) return 0x80009E3;
+ _IF_BITS_EQ(29) return 0x10000583;
+ _IF_BITS_EQ(30) return 0x20000C92;
+#undef _IF_BITS_EQ
+
+ /* Unreachable */
+ return 0;
+}
+
+/*
+ * Please note that LFSR doesn't work with a start state of 0.
+ */
+static inline u32 lfsr(u32 val, int bits, u32 taps)
+{
+ u32 bit = val & 1;
+
+ val >>= 1;
+ if (bit)
+ val ^= taps ? taps : lfsr_taps(bits);
+ return val;
+}
+
+#endif /* _LINUX_LFSR_H */
--
1.7.1
Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.
The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.
The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.
The head contender has no such storage directly visible to the
unlocker. So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 69 ++++++++-
kernel/locking/qspinlock_paravirt.h | 321 +++++++++++++++++++++++++++++++++++
2 files changed, 389 insertions(+), 1 deletions(-)
create mode 100644 kernel/locking/qspinlock_paravirt.h
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc2e5ab..33b3f54 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
* Authors: Waiman Long <[email protected]>
* Peter Zijlstra <[email protected]>
*/
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
#include <linux/smp.h>
#include <linux/bug.h>
#include <linux/cpumask.h>
@@ -65,13 +68,21 @@
#include "mcs_spinlock.h"
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES 8
+#else
+#define MAX_NODES 4
+#endif
+
/*
* Per-CPU queue node structures; we can never have more than 4 nested
* contexts: task, softirq, hardirq, nmi.
*
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
*/
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
/*
* We must be able to distinguish between no-tail and the tail at 0:0,
@@ -220,6 +231,33 @@ static __always_inline void set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
}
+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
+
+#define pv_enabled() false
+
+#define pv_init_node __pv_init_node
+#define pv_wait_node __pv_wait_node
+#define pv_kick_node __pv_kick_node
+
+#define pv_wait_head __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
/**
* queue_spin_lock_slowpath - acquire the queue spinlock
* @lock: Pointer to queue spinlock structure
@@ -249,6 +287,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+ if (pv_enabled())
+ goto queue;
+
if (virt_queue_spin_lock(lock))
return;
@@ -325,6 +366,7 @@ queue:
node += idx;
node->locked = 0;
node->next = NULL;
+ pv_init_node(node);
/*
* We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -350,6 +392,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
+ pv_wait_node(node);
arch_mcs_spin_lock_contended(&node->locked);
}
@@ -365,6 +408,7 @@ queue:
* does not imply a full barrier.
*
*/
+ pv_wait_head(lock, node);
while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
cpu_relax();
@@ -397,6 +441,7 @@ queue:
cpu_relax();
arch_mcs_spin_unlock_contended(&next->locked);
+ pv_kick_node(next);
release:
/*
@@ -405,3 +450,25 @@ release:
this_cpu_dec(mcs_nodes[0].count);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef pv_enabled
+#define pv_enabled() true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head
+
+#undef queue_spin_lock_slowpath
+#define queue_spin_lock_slowpath __pv_queue_spin_lock_slowpath
+
+#include "qspinlock_paravirt.h"
+#include "qspinlock.c"
+
+#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
new file mode 100644
index 0000000..49dbd39
--- /dev/null
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -0,0 +1,321 @@
+#ifndef _GEN_PV_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+/*
+ * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
+ * of spinning them.
+ *
+ * This relies on the architecture to provide two paravirt hypercalls:
+ *
+ * pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
+ * pv_kick(cpu) -- wakes a suspended vcpu
+ *
+ * Using these we implement __pv_queue_spin_lock_slowpath() and
+ * __pv_queue_spin_unlock() to replace native_queue_spin_lock_slowpath() and
+ * native_queue_spin_unlock().
+ */
+
+#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+
+enum vcpu_state {
+ vcpu_running = 0,
+ vcpu_halted,
+};
+
+struct pv_node {
+ struct mcs_spinlock mcs;
+ struct mcs_spinlock __res[3];
+
+ int cpu;
+ u8 state;
+};
+
+/*
+ * Hash table using open addressing with an LFSR probe sequence.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open addressing
+ * breaks down.
+ *
+ * Instead of probing just the immediate bucket we probe all buckets in the
+ * same cacheline.
+ *
+ * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
+ *
+ * Dynamically allocate a hash table big enough to hold at least 4X the
+ * number of possible cpus in the system. Allocation is done on page
+ * granularity. So the minimum number of hash buckets should be at least
+ * 256 to fully utilize a 4k page.
+ */
+#define LFSR_MIN_BITS 8
+#define LFSR_MAX_BITS (2 + NR_CPUS_BITS)
+#if LFSR_MAX_BITS < LFSR_MIN_BITS
+#undef LFSR_MAX_BITS
+#define LFSR_MAX_BITS LFSR_MIN_BITS
+#endif
+
+struct pv_hash_bucket {
+ struct qspinlock *lock;
+ struct pv_node *node;
+};
+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
+#define HB_RESERVED ((struct qspinlock *)1)
+
+static struct pv_hash_bucket *pv_lock_hash;
+static unsigned int pv_lock_hash_bits __read_mostly;
+
+#include <linux/hash.h>
+#include <linux/lfsr.h>
+#include <linux/bootmem.h>
+
+/*
+ * Allocate memory for the PV qspinlock hash buckets
+ *
+ * This function should be called from the paravirt spinlock initialization
+ * routine.
+ */
+void __init __pv_init_lock_hash(void)
+{
+ int pv_hash_size = 4 * num_possible_cpus();
+
+ if (pv_hash_size < (1U << LFSR_MIN_BITS))
+ pv_hash_size = (1U << LFSR_MIN_BITS);
+ /*
+ * Allocate space from bootmem which should be page-size aligned
+ * and hence cacheline aligned.
+ */
+ pv_lock_hash = alloc_large_system_hash("PV qspinlock",
+ sizeof(struct pv_hash_bucket),
+ pv_hash_size, 0, HASH_EARLY,
+ &pv_lock_hash_bits, NULL,
+ pv_hash_size, pv_hash_size);
+}
+
+static inline u32 hash_align(u32 hash)
+{
+ return hash & ~(PV_HB_PER_LINE - 1);
+}
+
+static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
+{
+ unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
+ struct pv_hash_bucket *hb, *end;
+
+ if (!hash)
+ hash = 1;
+
+ init_hash = hash;
+ hb = &pv_lock_hash[hash_align(hash)];
+ for (;;) {
+ for (end = hb + PV_HB_PER_LINE; hb < end; hb++) {
+ if (!cmpxchg(&hb->lock, NULL, lock)) {
+ WRITE_ONCE(hb->node, node);
+ /*
+ * We haven't set the _Q_SLOW_VAL yet. So
+ * the order of writing doesn't matter.
+ */
+ smp_wmb(); /* matches rmb from pv_hash_find */
+ goto done;
+ }
+ }
+
+ hash = lfsr(hash, pv_lock_hash_bits, 0);
+ hb = &pv_lock_hash[hash_align(hash)];
+ BUG_ON(hash == init_hash);
+ }
+
+done:
+ return &hb->lock;
+}
+
+static struct pv_node *pv_hash_find(struct qspinlock *lock)
+{
+ unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
+ struct pv_hash_bucket *hb, *end;
+ struct pv_node *node = NULL;
+
+ if (!hash)
+ hash = 1;
+
+ init_hash = hash;
+ hb = &pv_lock_hash[hash_align(hash)];
+ for (;;) {
+ for (end = hb + PV_HB_PER_LINE; hb < end; hb++) {
+ struct qspinlock *l = READ_ONCE(hb->lock);
+
+ if (l == lock) {
+ smp_rmb(); /* matches wmb from pv_hash() */
+ node = READ_ONCE(hb->node);
+ goto done;
+ }
+ }
+
+ hash = lfsr(hash, pv_lock_hash_bits, 0);
+ hb = &pv_lock_hash[hash_align(hash)];
+ BUG_ON(hash == init_hash);
+ }
+done:
+ /*
+ * Clear the hash bucket
+ */
+ WRITE_ONCE(hb->lock, NULL);
+ return node;
+}
+
+/*
+ * Initialize the PV part of the mcs_spinlock node.
+ */
+static void pv_init_node(struct mcs_spinlock *node)
+{
+ struct pv_node *pn = (struct pv_node *)node;
+
+ BUILD_BUG_ON(sizeof(struct pv_node) > 5*sizeof(struct mcs_spinlock));
+
+ pn->cpu = smp_processor_id();
+ pn->state = vcpu_running;
+}
+
+/*
+ * Wait for node->locked to become true, halt the vcpu after a short spin.
+ * pv_kick_node() is used to wake the vcpu again.
+ */
+static void pv_wait_node(struct mcs_spinlock *node)
+{
+ struct pv_node *pn = (struct pv_node *)node;
+ int loop;
+
+ for (;;) {
+ for (loop = SPIN_THRESHOLD; loop; loop--) {
+ if (READ_ONCE(node->locked))
+ return;
+
+ cpu_relax();
+ }
+
+ /*
+ * Order pn->state vs pn->locked thusly:
+ *
+ * [S] pn->state = vcpu_halted [S] next->locked = 1
+ * MB MB
+ * [L] pn->locked [RmW] pn->state = vcpu_running
+ *
+ * Matches the xchg() from pv_kick_node().
+ */
+ (void)xchg(&pn->state, vcpu_halted);
+
+ if (!READ_ONCE(node->locked))
+ pv_wait(&pn->state, vcpu_halted);
+
+ /* Make sure that state is correct for spurious wakeup */
+ WRITE_ONCE(pn->state, vcpu_running);
+ }
+
+ /*
+ * By now our node->locked should be 1 and our caller will not actually
+ * spin-wait for it. We do however rely on our caller to do a
+ * load-acquire for us.
+ */
+}
+
+/*
+ * Called after setting next->locked = 1, used to wake those stuck in
+ * pv_wait_node().
+ */
+static void pv_kick_node(struct mcs_spinlock *node)
+{
+ struct pv_node *pn = (struct pv_node *)node;
+
+ /*
+ * Note that because node->locked is already set, this actual
+ * mcs_spinlock entry could be re-used already.
+ *
+ * This should be fine however, kicking people for no reason is
+ * harmless.
+ *
+ * See the comment in pv_wait_node().
+ */
+ if (xchg(&pn->state, vcpu_running) == vcpu_halted)
+ pv_kick(pn->cpu);
+}
+
+/*
+ * Wait for l->locked to become clear; halt the vcpu after a short spin.
+ * __pv_queue_spin_unlock() will wake us.
+ */
+static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+{
+ struct __qspinlock *l = (void *)lock;
+ struct qspinlock **lp = NULL;
+ struct pv_node *pn = (struct pv_node *)node;
+ int slow_set = false;
+ int loop;
+
+ for (;;) {
+ for (loop = SPIN_THRESHOLD; loop; loop--) {
+ if (!READ_ONCE(l->locked))
+ return;
+
+ cpu_relax();
+ }
+
+ WRITE_ONCE(pn->state, vcpu_halted);
+ if (!lp)
+ lp = pv_hash(lock, pn);
+ /*
+ * lp must be set before setting _Q_SLOW_VAL
+ *
+ * [S] lp = lock [RmW] l = l->locked = 0
+ * MB MB
+ * [S] l->locked = _Q_SLOW_VAL [L] lp
+ *
+ * Matches the cmpxchg() in pv_queue_spin_unlock().
+ */
+ if (!slow_set &&
+ !cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+ /*
+ * The lock is free and _Q_SLOW_VAL has never been
+ * set. Need to clear the hash bucket before getting
+ * the lock.
+ */
+ WRITE_ONCE(*lp, NULL);
+ return;
+ } else if (slow_set && !READ_ONCE(l->locked))
+ return;
+ slow_set = true;
+
+ pv_wait(&l->locked, _Q_SLOW_VAL);
+ }
+ /*
+ * Lock is unlocked now; the caller will acquire it without waiting.
+ * As with pv_wait_node() we rely on the caller to do a load-acquire
+ * for us.
+ */
+}
+
+/*
+ * To be used in stead of queue_spin_unlock() for paravirt locks. Wakes
+ * pv_wait_head() if appropriate.
+ */
+__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ struct pv_node *node;
+
+ if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
+ return;
+
+ /*
+ * The queue head has been halted. Need to locate it and wake it up.
+ */
+ node = pv_hash_find(lock);
+ smp_store_release(&l->locked, 0);
+
+ /*
+ * At this point the memory pointed at by lock can be freed/reused,
+ * however we can still use the PV node to kick the CPU.
+ */
+ if (READ_ONCE(node->state) == vcpu_halted)
+ pv_kick(node->cpu);
+}
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
--
1.7.1
From: Peter Zijlstra (Intel) <[email protected]>
We use the regular paravirt call patching to switch between:
native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath()
native_queue_spin_unlock() __pv_queue_spin_unlock()
We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.
We further optimize the unlock path by patching the direct call with a
"movb $0,%arg1" if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.
This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/paravirt.h | 28 +++++++++++++++++++++++++++-
arch/x86/include/asm/paravirt_types.h | 10 ++++++++++
arch/x86/include/asm/qspinlock.h | 25 ++++++++++++++++++++++++-
arch/x86/kernel/paravirt-spinlocks.c | 24 +++++++++++++++++++++++-
arch/x86/kernel/paravirt_patch_32.c | 22 ++++++++++++++++++----
arch/x86/kernel/paravirt_patch_64.c | 22 ++++++++++++++++++----
7 files changed, 121 insertions(+), 12 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49fecb1..a0946e7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -661,7 +661,7 @@ config PARAVIRT_DEBUG
config PARAVIRT_SPINLOCKS
bool "Paravirtualization layer for spinlocks"
depends on PARAVIRT && SMP
- select UNINLINE_SPIN_UNLOCK
+ select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK
---help---
Paravirtualized spinlocks allow a pvops backend to replace the
spinlock implementation with something virtualization-friendly
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 965c47d..dd40269 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,30 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock)
+{
+ PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+ PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+ PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
{
@@ -724,7 +748,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
}
-#endif
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
+#endif /* SMP && PARAVIRT_SPINLOCKS */
#ifdef CONFIG_X86_32
#define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..f6acaea 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
typedef u16 __ticket_t;
#endif
+struct qspinlock;
+
struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+ void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+ struct paravirt_callee_save queue_spin_unlock;
+
+ void (*wait)(u8 *ptr, u8 val);
+ void (*kick)(int cpu);
+#else /* !CONFIG_QUEUE_SPINLOCK */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif /* !CONFIG_QUEUE_SPINLOCK */
};
/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 64c925e..c8290db 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -3,6 +3,7 @@
#include <asm/cpufeature.h>
#include <asm-generic/qspinlock_types.h>
+#include <asm/paravirt.h>
#define queue_spin_unlock queue_spin_unlock
/**
@@ -11,11 +12,33 @@
*
* A smp_store_release() on the least-significant byte.
*/
-static inline void queue_spin_unlock(struct qspinlock *lock)
+static inline void native_queue_spin_unlock(struct qspinlock *lock)
{
smp_store_release((u8 *)lock, 0);
}
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queue_spin_unlock(struct qspinlock *lock);
+
+static inline void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ pv_queue_spin_lock_slowpath(lock, val);
+}
+
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ pv_queue_spin_unlock(lock);
+}
+#else
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+ native_queue_spin_unlock(lock);
+}
+#endif
+
#define virt_queue_spin_lock virt_queue_spin_lock
static inline bool virt_queue_spin_lock(struct qspinlock *lock)
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..ff134b4 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,11 +8,33 @@
#include <asm/paravirt.h>
+#ifdef CONFIG_QUEUE_SPINLOCK
+__visible void __native_queue_spin_unlock(struct qspinlock *lock)
+{
+ native_queue_spin_unlock(lock);
+}
+
+PV_CALLEE_SAVE_REGS_THUNK(__native_queue_spin_unlock);
+
+bool pv_is_native_spin_unlock(void)
+{
+ return pv_lock_ops.queue_spin_unlock.func ==
+ __raw_callee_save___native_queue_spin_unlock;
+}
+#endif
+
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+ .queue_spin_lock_slowpath = native_queue_spin_lock_slowpath,
+ .queue_spin_unlock = PV_CALLEE_SAVE(__native_queue_spin_unlock),
+ .wait = paravirt_nop,
+ .kick = paravirt_nop,
+#else /* !CONFIG_QUEUE_SPINLOCK */
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
-#endif
+#endif /* !CONFIG_QUEUE_SPINLOCK */
+#endif /* SMP */
};
EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d9f32e6..5462727 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,10 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
DEF_NATIVE(pv_cpu_ops, clts, "clts");
DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUE_SPINLOCKS)
+DEF_NATIVE(pv_lock_ops, queue_spin_unlock, "movb $0, (%eax)");
+#endif
+
unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
{
/* arg in %eax, return in %eax */
@@ -24,6 +28,8 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
return 0;
}
+extern bool pv_is_native_spin_unlock(void);
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{
@@ -47,14 +53,22 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
PATCH_SITE(pv_mmu_ops, write_cr3);
PATCH_SITE(pv_cpu_ops, clts);
PATCH_SITE(pv_cpu_ops, read_tsc);
-
- patch_site:
- ret = paravirt_patch_insns(ibuf, len, start, end);
- break;
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUE_SPINLOCKS)
+ case PARAVIRT_PATCH(pv_lock_ops.queue_spin_unlock):
+ if (pv_is_native_spin_unlock()) {
+ start = start_pv_lock_ops_queue_spin_unlock;
+ end = end_pv_lock_ops_queue_spin_unlock;
+ goto patch_site;
+ }
+#endif
default:
ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
break;
+
+ patch_site:
+ ret = paravirt_patch_insns(ibuf, len, start, end);
+ break;
}
#undef PATCH_SITE
return ret;
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index a1da673..6936344 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -21,6 +21,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
DEF_NATIVE(, mov32, "mov %edi, %eax");
DEF_NATIVE(, mov64, "mov %rdi, %rax");
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUE_SPINLOCK)
+DEF_NATIVE(pv_lock_ops, queue_spin_unlock, "movb $0, (%rdi)");
+#endif
+
unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
{
return paravirt_patch_insns(insnbuf, len,
@@ -33,6 +37,8 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
start__mov64, end__mov64);
}
+extern bool pv_is_native_spin_unlock(void);
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{
@@ -59,14 +65,22 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
PATCH_SITE(pv_cpu_ops, clts);
PATCH_SITE(pv_mmu_ops, flush_tlb_single);
PATCH_SITE(pv_cpu_ops, wbinvd);
-
- patch_site:
- ret = paravirt_patch_insns(ibuf, len, start, end);
- break;
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUE_SPINLOCK)
+ case PARAVIRT_PATCH(pv_lock_ops.queue_spin_unlock):
+ if (pv_is_native_spin_unlock()) {
+ start = start_pv_lock_ops_queue_spin_unlock;
+ end = end_pv_lock_ops_queue_spin_unlock;
+ goto patch_site;
+ }
+#endif
default:
ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
break;
+
+ patch_site:
+ ret = paravirt_patch_insns(ibuf, len, start, end);
+ break;
}
#undef PATCH_SITE
return ret;
--
1.7.1
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/kernel/kvm.c | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 2 +-
2 files changed, 44 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e354cc6..4bb42c0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
}
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include <asm/qspinlock.h>
+
+static void kvm_wait(u8 *ptr, u8 val)
+{
+ unsigned long flags;
+
+ if (in_nmi())
+ return;
+
+ local_irq_save(flags);
+
+ if (READ_ONCE(*ptr) != val)
+ goto out;
+
+ /*
+ * halt until it's our turn and kicked. Note that we do safe halt
+ * for irq enabled case to avoid hang when lock info is overwritten
+ * in irq spinlock slowpath and no spurious interrupt occur to save us.
+ */
+ if (arch_irqs_disabled_flags(flags))
+ halt();
+ else
+ safe_halt();
+
+out:
+ local_irq_restore(flags);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
}
}
+#endif /* !CONFIG_QUEUE_SPINLOCK */
+
/*
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
*/
@@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
+#ifdef CONFIG_QUEUE_SPINLOCK
+ __pv_init_lock_hash();
+ pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+ pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+ pv_lock_ops.wait = kvm_wait;
+ pv_lock_ops.kick = kvm_kick_cpu;
+#else /* !CONFIG_QUEUE_SPINLOCK */
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
}
static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index c6a8f7c..537b13e 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
- depends on SMP && !PARAVIRT_SPINLOCKS
+ depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
config ARCH_USE_QUEUE_RWLOCK
bool
--
1.7.1
This patch adds the necessary Xen specific code to allow Xen to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/xen/spinlock.c | 63 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/Kconfig.locks | 2 +-
2 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 956374c..728b45b 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,55 @@
#include "xen-ops.h"
#include "debugfs.h"
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include <asm/qspinlock.h>
+
+static void xen_qlock_kick(int cpu)
+{
+ xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void xen_qlock_wait(u8 *byte, u8 val)
+{
+ int irq = __this_cpu_read(lock_kicker_irq);
+
+ /* If kicker interrupts not initialized yet, just spin */
+ if (irq == -1)
+ return;
+
+ /* clear pending */
+ xen_clear_irq_pending(irq);
+
+ /*
+ * We check the byte value after clearing pending IRQ to make sure
+ * that we won't miss a wakeup event because of the clearing.
+ *
+ * The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
+ * So it is effectively a memory barrier for x86.
+ */
+ if (READ_ONCE(*byte) != val)
+ return;
+
+ /*
+ * If an interrupt happens here, it will leave the wakeup irq
+ * pending, which will cause xen_poll_irq() to return
+ * immediately.
+ */
+
+ /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+ xen_poll_irq(irq);
+}
+
+#else /* CONFIG_QUEUE_SPINLOCK */
+
enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +149,9 @@ struct xen_lock_waiting {
__ticket_t want;
};
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
static cpumask_t waiting_cpus;
-static bool xen_pvspin = true;
__visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
{
int irq = __this_cpu_read(lock_kicker_irq);
@@ -217,6 +263,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
}
}
}
+#endif /* CONFIG_QUEUE_SPINLOCK */
static irqreturn_t dummy_handler(int irq, void *dev_id)
{
@@ -280,8 +327,16 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG "xen: PV spinlocks enabled\n");
+#ifdef CONFIG_QUEUE_SPINLOCK
+ __pv_init_lock_hash();
+ pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+ pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+ pv_lock_ops.wait = xen_qlock_wait;
+ pv_lock_ops.kick = xen_qlock_kick;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
}
/*
@@ -310,7 +365,7 @@ static __init int xen_parse_nopvspin(char *arg)
}
early_param("xen_nopvspin", xen_parse_nopvspin);
-#ifdef CONFIG_XEN_DEBUG_FS
+#if defined(CONFIG_XEN_DEBUG_FS) && !defined(CONFIG_QUEUE_SPINLOCK)
static struct dentry *d_spin_debug;
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 537b13e..0b42933 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
- depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
+ depends on SMP
config ARCH_USE_QUEUE_RWLOCK
bool
--
1.7.1
Before this patch, a CPU may have been kicked twice before getting
the lock - one before it becomes queue head and once before it gets
the lock. All these CPU kicking and halting (VMEXIT) can be expensive
and slow down system performance, especially in an overcommitted guest.
This patch add a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 10 ++--
kernel/locking/qspinlock_paravirt.h | 76 +++++++++++++++++++++++++----------
2 files changed, 59 insertions(+), 27 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 33b3f54..b9ba83b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_scan_next(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }
@@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
-#define pv_kick_node __pv_kick_node
+#define pv_scan_next __pv_scan_next
#define pv_wait_head __pv_wait_head
@@ -441,7 +441,7 @@ queue:
cpu_relax();
arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_scan_next(lock, next);
release:
/*
@@ -462,7 +462,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath);
#undef pv_init_node
#undef pv_wait_node
-#undef pv_kick_node
+#undef pv_scan_next
#undef pv_wait_head
#undef queue_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 49dbd39..a210061 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -18,9 +18,16 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+/*
+ * The vcpu_hashed is a special state that is set by the new lock holder on
+ * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry
+ * filled. With this state, the queue head CPU will always be kicked even
+ * if it is not halted to avoid potential racing condition.
+ */
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+ vcpu_hashed
};
struct pv_node {
@@ -97,7 +104,13 @@ static inline u32 hash_align(u32 hash)
return hash & ~(PV_HB_PER_LINE - 1);
}
-static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
+/*
+ * Set up an entry in the lock hash table
+ * This is not inlined to reduce size of generated code as it is included
+ * twice and is used only in the slowest path of handling CPU halting.
+ */
+static noinline struct qspinlock **
+pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_bucket *hb, *end;
@@ -178,7 +191,8 @@ static void pv_init_node(struct mcs_spinlock *node)
/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
*/
static void pv_wait_node(struct mcs_spinlock *node)
{
@@ -189,7 +203,6 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
-
cpu_relax();
}
@@ -198,17 +211,21 @@ static void pv_wait_node(struct mcs_spinlock *node)
*
* [S] pn->state = vcpu_halted [S] next->locked = 1
* MB MB
- * [L] pn->locked [RmW] pn->state = vcpu_running
+ * [L] pn->locked [RmW] pn->state = vcpu_hashed
*
- * Matches the xchg() from pv_kick_node().
+ * Matches the cmpxchg() from pv_scan_next().
*/
(void)xchg(&pn->state, vcpu_halted);
if (!READ_ONCE(node->locked))
pv_wait(&pn->state, vcpu_halted);
- /* Make sure that state is correct for spurious wakeup */
- WRITE_ONCE(pn->state, vcpu_running);
+ /*
+ * Reset the state except when vcpu_hashed is set. At this
+ * state, node->locked should have been set already and it
+ * needs to move on to pv_wait_head().
+ */
+ (void)cmpxchg(&pn->state, vcpu_halted, vcpu_running);
}
/*
@@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
}
/*
- * Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * Called after setting next->locked = 1 & lock acquired.
+ * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
+ * and put an entry into the lock hash table to be waken up at unlock time.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct __qspinlock *l = (void *)lock;
/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
- *
- * This should be fine however, kicking people for no reason is
- * harmless.
- *
- * See the comment in pv_wait_node().
+ * Transition CPU state: halted => hashed
+ * Quit if the transition failed.
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
- pv_kick(pn->cpu);
+ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ return;
+
+ /*
+ * Put the lock into the hash table & set the _Q_SLOW_VAL in the lock.
+ * As this is the same CPU that will check the _Q_SLOW_VAL value and
+ * the hash table later on at unlock time, no atomic instruction is
+ * needed.
+ */
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
}
/*
@@ -259,7 +282,16 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}
- WRITE_ONCE(pn->state, vcpu_halted);
+ /*
+ * Go directly to pv_wait() if it has already been in the
+ * hashed state - _Q_SLOW_VAL set & hash table filled.
+ * This is to eliminate possible race condition of not
+ * properly clearing the hash table entry.
+ */
+ if (cmpxchg(&pn->state, vcpu_running, vcpu_halted)
+ == vcpu_hashed)
+ goto wait_now;
+
if (!lp)
lp = pv_hash(lock, pn);
/*
@@ -283,7 +315,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
} else if (slow_set && !READ_ONCE(l->locked))
return;
slow_set = true;
-
+wait_now:
pv_wait(&l->locked, _Q_SLOW_VAL);
}
/*
@@ -315,7 +347,7 @@ __visible void __pv_queue_spin_unlock(struct qspinlock *lock)
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the PV node to kick the CPU.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
+ if (READ_ONCE(node->state) != vcpu_running)
pv_kick(node->cpu);
}
PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
--
1.7.1
In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.
This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index a210061..a9fe10d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -16,7 +16,8 @@
* native_queue_spin_unlock().
*/
-#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+#define MAYHALT_THRESHOLD (SPIN_THRESHOLD >> 4)
/*
* The vcpu_hashed is a special state that is set by the new lock holder on
@@ -36,6 +37,7 @@ struct pv_node {
int cpu;
u8 state;
+ u8 mayhalt;
};
/*
@@ -187,6 +189,7 @@ static void pv_init_node(struct mcs_spinlock *node)
pn->cpu = smp_processor_id();
pn->state = vcpu_running;
+ pn->mayhalt = false;
}
/*
@@ -203,17 +206,27 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
+ if (loop == MAYHALT_THRESHOLD)
+ xchg(&pn->mayhalt, true);
cpu_relax();
}
/*
- * Order pn->state vs pn->locked thusly:
+ * Order pn->state/pn->mayhalt vs pn->locked thusly:
*
- * [S] pn->state = vcpu_halted [S] next->locked = 1
+ * [S] pn->mayhalt = 1 [S] next->locked = 1
+ * MB, delay barrier()
+ * [S] pn->state = vcpu_halted [L] pn->mayhalt
* MB MB
* [L] pn->locked [RmW] pn->state = vcpu_hashed
*
* Matches the cmpxchg() from pv_scan_next().
+ *
+ * As the new lock holder may quit (when pn->mayhalt is not
+ * set) without memory barrier, a sufficiently long delay is
+ * inserted between the setting of pn->mayhalt and pn->state
+ * to ensure that there is enough time for the new pn->locked
+ * value to be propagated here to be checked below.
*/
(void)xchg(&pn->state, vcpu_halted);
@@ -226,6 +239,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* needs to move on to pv_wait_head().
*/
(void)cmpxchg(&pn->state, vcpu_halted, vcpu_running);
+ pn->mayhalt = false;
}
/*
@@ -246,6 +260,14 @@ static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
struct __qspinlock *l = (void *)lock;
/*
+ * If mayhalt is not set, there is enough time for the just set value
+ * in pn->locked to be propagated to the other CPU before it is time
+ * to halt.
+ */
+ if (!READ_ONCE(pn->mayhalt))
+ return;
+
+ /*
* Transition CPU state: halted => hashed
* Quit if the transition failed.
*/
--
1.7.1
The current code for PV lock hash table processing will panic the
system if pv_hash_find() can't find the desired hash bucket. However,
there is no check to see if there is more than one entry for a given
lock which should never happen.
This patch adds a pv_hash_check_duplicate() function to do that which
will only be enabled if CONFIG_DEBUG_SPINLOCK is defined because of
the performance overhead it introduces.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 58 +++++++++++++++++++++++++++++++++++
1 files changed, 58 insertions(+), 0 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index a9fe10d..4d39c8b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -107,6 +107,63 @@ static inline u32 hash_align(u32 hash)
}
/*
+ * Hash table debugging code
+ */
+#ifdef CONFIG_DEBUG_SPINLOCK
+
+#define _NODE_IDX(pn) ((((unsigned long)pn) & (SMP_CACHE_BYTES - 1)) /\
+ sizeof(struct mcs_spinlock))
+/*
+ * Check if there is additional hash buckets with the same lock which
+ * should not happen.
+ */
+static inline void pv_hash_check_duplicate(struct qspinlock *lock)
+{
+ struct pv_hash_bucket *hb, *end, *hb1 = NULL;
+ int count = 0, used = 0;
+
+ end = &pv_lock_hash[1 << pv_lock_hash_bits];
+ for (hb = pv_lock_hash; hb < end; hb++) {
+ struct qspinlock *l = READ_ONCE(hb->lock);
+ struct pv_node *pn;
+
+ if (l)
+ used++;
+ if (l != lock)
+ continue;
+ if (++count == 1) {
+ hb1 = hb;
+ continue;
+ }
+ WARN_ON(count == 2);
+ if (hb1) {
+ pn = READ_ONCE(hb1->node);
+ printk(KERN_ERR "PV lock hash error: duplicated entry "
+ "#%d - hash %ld, node %ld, cpu %d\n", 1,
+ hb1 - pv_lock_hash, _NODE_IDX(pn),
+ pn ? pn->cpu : -1);
+ hb1 = NULL;
+ }
+ pn = READ_ONCE(hb->node);
+ printk(KERN_ERR "PV lock hash error: duplicated entry #%d - "
+ "hash %ld, node %ld, cpu %d\n", count, hb - pv_lock_hash,
+ _NODE_IDX(pn), pn ? pn->cpu : -1);
+ }
+ /*
+ * Warn if more than half of the buckets are used
+ */
+ if (used > (1 << (pv_lock_hash_bits - 1)))
+ printk(KERN_WARNING "PV lock hash warning: "
+ "%d hash entries used!\n", used);
+}
+
+#else /* CONFIG_DEBUG_SPINLOCK */
+
+static inline void pv_hash_check_duplicate(struct qspinlock *lock) {}
+
+#endif /* CONFIG_DEBUG_SPINLOCK */
+
+/*
* Set up an entry in the lock hash table
* This is not inlined to reduce size of generated code as it is included
* twice and is used only in the slowest path of handling CPU halting.
@@ -141,6 +198,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
}
done:
+ pv_hash_check_duplicate(lock);
return &hb->lock;
}
--
1.7.1
On 07/04/15 03:55, Waiman Long wrote:
> This patch adds the necessary Xen specific code to allow Xen to
> support the CPU halting and kicking operations needed by the queue
> spinlock PV code.
This basically looks the same as the version I wrote, except I think you
broke it.
> +static void xen_qlock_wait(u8 *byte, u8 val)
> +{
> + int irq = __this_cpu_read(lock_kicker_irq);
> +
> + /* If kicker interrupts not initialized yet, just spin */
> + if (irq == -1)
> + return;
> +
> + /* clear pending */
> + xen_clear_irq_pending(irq);
> +
> + /*
> + * We check the byte value after clearing pending IRQ to make sure
> + * that we won't miss a wakeup event because of the clearing.
My version had a barrier() here to ensure this. The documentation of
READ_ONCE() suggests that it is not sufficient to meet this requirement
(and a READ_ONCE() here is not required anyway).
> + *
> + * The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
> + * So it is effectively a memory barrier for x86.
> + */
> + if (READ_ONCE(*byte) != val)
> + return;
> +
> + /*
> + * If an interrupt happens here, it will leave the wakeup irq
> + * pending, which will cause xen_poll_irq() to return
> + * immediately.
> + */
> +
> + /* Block until irq becomes pending (or perhaps a spurious wakeup) */
> + xen_poll_irq(irq);
> +}
David
On 04/08/2015 08:01 AM, David Vrabel wrote:
> On 07/04/15 03:55, Waiman Long wrote:
>> This patch adds the necessary Xen specific code to allow Xen to
>> support the CPU halting and kicking operations needed by the queue
>> spinlock PV code.
> This basically looks the same as the version I wrote, except I think you
> broke it.
>
>> +static void xen_qlock_wait(u8 *byte, u8 val)
>> +{
>> + int irq = __this_cpu_read(lock_kicker_irq);
>> +
>> + /* If kicker interrupts not initialized yet, just spin */
>> + if (irq == -1)
>> + return;
>> +
>> + /* clear pending */
>> + xen_clear_irq_pending(irq);
>> +
>> + /*
>> + * We check the byte value after clearing pending IRQ to make sure
>> + * that we won't miss a wakeup event because of the clearing.
> My version had a barrier() here to ensure this. The documentation of
> READ_ONCE() suggests that it is not sufficient to meet this requirement
> (and a READ_ONCE() here is not required anyway).
I have the misconception that a READ_ONCE/WRITE_ONCE() implies a
compiler barrier. You are right that it may not be the case. I will add
back the missing barrier when I update the patch and add you as the
author of this patch if you don't mind.
Cheers,
Longman
On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -0,0 +1,321 @@
> +#ifndef _GEN_PV_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +/*
> + * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
> + * of spinning them.
> + *
> + * This relies on the architecture to provide two paravirt hypercalls:
> + *
> + * pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
> + * pv_kick(cpu) -- wakes a suspended vcpu
> + *
> + * Using these we implement __pv_queue_spin_lock_slowpath() and
> + * __pv_queue_spin_unlock() to replace native_queue_spin_lock_slowpath() and
> + * native_queue_spin_unlock().
> + */
> +
> +#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
> +
> +enum vcpu_state {
> + vcpu_running = 0,
> + vcpu_halted,
> +};
> +
> +struct pv_node {
> + struct mcs_spinlock mcs;
> + struct mcs_spinlock __res[3];
> +
> + int cpu;
> + u8 state;
> +};
> +
> +/*
> + * Hash table using open addressing with an LFSR probe sequence.
> + *
> + * Since we should not be holding locks from NMI context (very rare indeed) the
> + * max load factor is 0.75, which is around the point where open addressing
> + * breaks down.
> + *
> + * Instead of probing just the immediate bucket we probe all buckets in the
> + * same cacheline.
> + *
> + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
> + *
> + * Dynamically allocate a hash table big enough to hold at least 4X the
> + * number of possible cpus in the system. Allocation is done on page
> + * granularity. So the minimum number of hash buckets should be at least
> + * 256 to fully utilize a 4k page.
> + */
> +#define LFSR_MIN_BITS 8
> +#define LFSR_MAX_BITS (2 + NR_CPUS_BITS)
> +#if LFSR_MAX_BITS < LFSR_MIN_BITS
> +#undef LFSR_MAX_BITS
> +#define LFSR_MAX_BITS LFSR_MIN_BITS
> +#endif
> +
> +struct pv_hash_bucket {
> + struct qspinlock *lock;
> + struct pv_node *node;
> +};
> +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
> +#define HB_RESERVED ((struct qspinlock *)1)
This is unused.
> +
> +static struct pv_hash_bucket *pv_lock_hash;
> +static unsigned int pv_lock_hash_bits __read_mostly;
static unsigned int pv_taps __read_mostly;
> +
> +#include <linux/hash.h>
> +#include <linux/lfsr.h>
> +#include <linux/bootmem.h>
> +
> +/*
> + * Allocate memory for the PV qspinlock hash buckets
> + *
> + * This function should be called from the paravirt spinlock initialization
> + * routine.
> + */
> +void __init __pv_init_lock_hash(void)
> +{
> + int pv_hash_size = 4 * num_possible_cpus();
> +
> + if (pv_hash_size < (1U << LFSR_MIN_BITS))
> + pv_hash_size = (1U << LFSR_MIN_BITS);
> + /*
> + * Allocate space from bootmem which should be page-size aligned
> + * and hence cacheline aligned.
> + */
> + pv_lock_hash = alloc_large_system_hash("PV qspinlock",
> + sizeof(struct pv_hash_bucket),
> + pv_hash_size, 0, HASH_EARLY,
> + &pv_lock_hash_bits, NULL,
> + pv_hash_size, pv_hash_size);
pv_taps = lfsr_taps(pv_lock_hash_bits);
> +}
> +
> +static inline u32 hash_align(u32 hash)
> +{
> + return hash & ~(PV_HB_PER_LINE - 1);
> +}
> +
> +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
> +{
> + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
> + struct pv_hash_bucket *hb, *end;
> +
> + if (!hash)
> + hash = 1;
> +
> + init_hash = hash;
> + hb = &pv_lock_hash[hash_align(hash)];
> + for (;;) {
> + for (end = hb + PV_HB_PER_LINE; hb < end; hb++) {
> + if (!cmpxchg(&hb->lock, NULL, lock)) {
> + WRITE_ONCE(hb->node, node);
> + /*
> + * We haven't set the _Q_SLOW_VAL yet. So
> + * the order of writing doesn't matter.
> + */
> + smp_wmb(); /* matches rmb from pv_hash_find */
This doesn't make sense. Both sites do ->lock first and ->node second.
No amount of ordering can 'fix' that.
I think we can safely remove this wmb and the rmb below, because the
required ordering is already provided by setting/observing l->locked ==
SLOW.
> + goto done;
> + }
> + }
> +
> + hash = lfsr(hash, pv_lock_hash_bits, 0);
Since pv_lock_hash_bits is a variable, you end up running through that
massive if() forest to find the corresponding tap every single time. It
cannot compile-time optimize it.
Hence:
hash = lfsr(hash, pv_taps);
(I don't get the bits argument to the lfsr).
In any case, like I said before, I think we should try a linear probe
sequence first, the lfsr was over engineering from my side.
> + hb = &pv_lock_hash[hash_align(hash)];
> + BUG_ON(hash == init_hash);
> + }
> +
> +done:
> + return &hb->lock;
> +}
> +
> +static struct pv_node *pv_hash_find(struct qspinlock *lock)
> +{
> + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
> + struct pv_hash_bucket *hb, *end;
> + struct pv_node *node = NULL;
> +
> + if (!hash)
> + hash = 1;
> +
> + init_hash = hash;
> + hb = &pv_lock_hash[hash_align(hash)];
> + for (;;) {
> + for (end = hb + PV_HB_PER_LINE; hb < end; hb++) {
> + struct qspinlock *l = READ_ONCE(hb->lock);
> +
> + if (l == lock) {
> + smp_rmb(); /* matches wmb from pv_hash() */
per the above this can go, IF we observe SLOW we must also observe a
consistent bucket.
> + node = READ_ONCE(hb->node);
> + goto done;
> + }
> + }
> +
> + hash = lfsr(hash, pv_lock_hash_bits, 0);
idem the previous lfsr comment.
> + hb = &pv_lock_hash[hash_align(hash)];
> + BUG_ON(hash == init_hash);
> + }
> +done:
> + /*
> + * Clear the hash bucket
> + */
> + WRITE_ONCE(hb->lock, NULL);
> + return node;
> +}
> +/*
> + * Wait for l->locked to become clear; halt the vcpu after a short spin.
> + * __pv_queue_spin_unlock() will wake us.
> + */
> +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> +{
> + struct __qspinlock *l = (void *)lock;
> + struct qspinlock **lp = NULL;
> + struct pv_node *pn = (struct pv_node *)node;
> + int slow_set = false;
> + int loop;
> +
> + for (;;) {
> + for (loop = SPIN_THRESHOLD; loop; loop--) {
> + if (!READ_ONCE(l->locked))
> + return;
> +
> + cpu_relax();
> + }
> +
> + WRITE_ONCE(pn->state, vcpu_halted);
> + if (!lp)
> + lp = pv_hash(lock, pn);
> + /*
> + * lp must be set before setting _Q_SLOW_VAL
> + *
> + * [S] lp = lock [RmW] l = l->locked = 0
> + * MB MB
> + * [S] l->locked = _Q_SLOW_VAL [L] lp
> + *
> + * Matches the cmpxchg() in pv_queue_spin_unlock().
> + */
> + if (!slow_set &&
> + !cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
> + /*
> + * The lock is free and _Q_SLOW_VAL has never been
> + * set. Need to clear the hash bucket before getting
> + * the lock.
> + */
> + WRITE_ONCE(*lp, NULL);
> + return;
> + } else if (slow_set && !READ_ONCE(l->locked))
> + return;
> + slow_set = true;
I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
I had, namely:
if (!lp) {
lp = pv_hash(lock, pn);
/*
* comment
*/
lv = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
if (lv != _Q_LOCKED_VAL) {
/* we're woken, unhash and return */
WRITE_ONCE(*lp, NULL);
return;
}
}
> +
> + pv_wait(&l->locked, _Q_SLOW_VAL);
If we get a spurious wakeup (due to device interrupts or random kick)
we'll loop around but ->locked will remain _Q_SLOW_VAL.
> + }
> + /*
> + * Lock is unlocked now; the caller will acquire it without waiting.
> + * As with pv_wait_node() we rely on the caller to do a load-acquire
> + * for us.
> + */
> +}
> +
> +/*
> + * To be used in stead of queue_spin_unlock() for paravirt locks. Wakes
> + * pv_wait_head() if appropriate.
> + */
> +__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
> +{
> + struct __qspinlock *l = (void *)lock;
> + struct pv_node *node;
> +
> + if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
> + return;
> +
> + /*
> + * The queue head has been halted. Need to locate it and wake it up.
> + */
> + node = pv_hash_find(lock);
> + smp_store_release(&l->locked, 0);
Ah yes, clever that.
> + /*
> + * At this point the memory pointed at by lock can be freed/reused,
> + * however we can still use the PV node to kick the CPU.
> + */
> + if (READ_ONCE(node->state) == vcpu_halted)
> + pv_kick(node->cpu);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
code.
On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
> > +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
> > +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
> > +{
> > + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
> > + struct pv_hash_bucket *hb, *end;
> > +
> > + if (!hash)
> > + hash = 1;
> > +
> > + init_hash = hash;
> > + hb = &pv_lock_hash[hash_align(hash)];
> > + for (;;) {
> > + for (end = hb + PV_HB_PER_LINE; hb < end; hb++) {
> > + if (!cmpxchg(&hb->lock, NULL, lock)) {
> > + WRITE_ONCE(hb->node, node);
> > + /*
> > + * We haven't set the _Q_SLOW_VAL yet. So
> > + * the order of writing doesn't matter.
> > + */
> > + smp_wmb(); /* matches rmb from pv_hash_find */
> > + goto done;
> > + }
> > + }
> > +
> > + hash = lfsr(hash, pv_lock_hash_bits, 0);
>
> Since pv_lock_hash_bits is a variable, you end up running through that
> massive if() forest to find the corresponding tap every single time. It
> cannot compile-time optimize it.
>
> Hence:
> hash = lfsr(hash, pv_taps);
>
> (I don't get the bits argument to the lfsr).
>
> In any case, like I said before, I think we should try a linear probe
> sequence first, the lfsr was over engineering from my side.
>
> > + hb = &pv_lock_hash[hash_align(hash)];
So one thing this does -- and one of the reasons I figured I should
ditch the LFSR instead of fixing it -- is that you end up scanning each
bucket HB_PER_LINE times.
The 'fix' would be to LFSR on cachelines instead of HBs but then you're
stuck with the 0-th cacheline.
> > + BUG_ON(hash == init_hash);
> > + }
> > +
> > +done:
> > + return &hb->lock;
> > +}
On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote:
> @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
> }
>
> /*
> + * Called after setting next->locked = 1 & lock acquired.
> + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
> + * and put an entry into the lock hash table to be waken up at unlock time.
> */
> -static void pv_kick_node(struct mcs_spinlock *node)
> +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
I'm not too sure about that name change..
> {
> struct pv_node *pn = (struct pv_node *)node;
> + struct __qspinlock *l = (void *)lock;
>
> /*
> + * Transition CPU state: halted => hashed
> + * Quit if the transition failed.
> */
> + if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> + return;
> +
> + /*
> + * Put the lock into the hash table & set the _Q_SLOW_VAL in the lock.
> + * As this is the same CPU that will check the _Q_SLOW_VAL value and
> + * the hash table later on at unlock time, no atomic instruction is
> + * needed.
> + */
> + WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> + (void)pv_hash(lock, pn);
> }
This is broken. The unlock path relies on:
pv_hash()
MB
l->locked = SLOW
such that when it observes SLOW, it must then also observe a consistent
bucket.
The above can have us do pv_hash_find() _before_ we actually hash the
lock, which will result in us triggering that BUG_ON() in there.
On Thu, Apr 09, 2015 at 09:57:21PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote:
>
> > @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
> > }
> >
> > /*
> > + * Called after setting next->locked = 1 & lock acquired.
> > + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
> > + * and put an entry into the lock hash table to be waken up at unlock time.
> > */
> > -static void pv_kick_node(struct mcs_spinlock *node)
> > +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
>
> I'm not too sure about that name change..
>
> > {
> > struct pv_node *pn = (struct pv_node *)node;
> > + struct __qspinlock *l = (void *)lock;
> >
> > /*
> > + * Transition CPU state: halted => hashed
> > + * Quit if the transition failed.
> > */
> > + if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> > + return;
> > +
> > + /*
> > + * Put the lock into the hash table & set the _Q_SLOW_VAL in the lock.
> > + * As this is the same CPU that will check the _Q_SLOW_VAL value and
> > + * the hash table later on at unlock time, no atomic instruction is
> > + * needed.
> > + */
> > + WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> > + (void)pv_hash(lock, pn);
> > }
>
> This is broken. The unlock path relies on:
>
> pv_hash()
> MB
> l->locked = SLOW
>
> such that when it observes SLOW, it must then also observe a consistent
> bucket.
>
> The above can have us do pv_hash_find() _before_ we actually hash the
> lock, which will result in us triggering that BUG_ON() in there.
Urgh, clearly its late and I cannot read. The comment explains it.
On 04/09/2015 02:23 PM, Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
>>> +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
>>> +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
>>> +{
>>> + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
>>> + struct pv_hash_bucket *hb, *end;
>>> +
>>> + if (!hash)
>>> + hash = 1;
>>> +
>>> + init_hash = hash;
>>> + hb =&pv_lock_hash[hash_align(hash)];
>>> + for (;;) {
>>> + for (end = hb + PV_HB_PER_LINE; hb< end; hb++) {
>>> + if (!cmpxchg(&hb->lock, NULL, lock)) {
>>> + WRITE_ONCE(hb->node, node);
>>> + /*
>>> + * We haven't set the _Q_SLOW_VAL yet. So
>>> + * the order of writing doesn't matter.
>>> + */
>>> + smp_wmb(); /* matches rmb from pv_hash_find */
>>> + goto done;
>>> + }
>>> + }
>>> +
>>> + hash = lfsr(hash, pv_lock_hash_bits, 0);
>> Since pv_lock_hash_bits is a variable, you end up running through that
>> massive if() forest to find the corresponding tap every single time. It
>> cannot compile-time optimize it.
>>
>> Hence:
>> hash = lfsr(hash, pv_taps);
>>
>> (I don't get the bits argument to the lfsr).
>>
>> In any case, like I said before, I think we should try a linear probe
>> sequence first, the lfsr was over engineering from my side.
>>
>>> + hb =&pv_lock_hash[hash_align(hash)];
>>>
> So one thing this does -- and one of the reasons I figured I should
> ditch the LFSR instead of fixing it -- is that you end up scanning each
> bucket HB_PER_LINE times.
I am aware of that when I was trying to add the hash table debug code,
but I want to get the code out for review and so hasn't made any change
yet. I have just done testing by adding some debug code to check the
hashing efficiency. With the kernel build workload, with over 1M calls
to pv_hash(), all of them get an empty entry on the first try. Maybe the
minimum hash table size of 256 helps.
>
> The 'fix' would be to LFSR on cachelines instead of HBs but then you're
> stuck with the 0-th cacheline.
This should not be a big problem. I just need to add a check at the end
of the for loop that if hash is 0, change it to a certain non-0 value
instead of calling lfsr().
As for ditching the lfsr idea, I am fine with that. So there will be 4
entries (1 cacheline) for each hash value. If all the entries are full,
we proceed to the next cacheline. Right?
Cheers,
Longman
On 04/09/2015 02:13 PM, Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -0,0 +1,321 @@
>> +#ifndef _GEN_PV_LOCK_SLOWPATH
>> +#error "do not include this file"
>> +#endif
>> +
>> +/*
>> + * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
>> + * of spinning them.
>> + *
>> + * This relies on the architecture to provide two paravirt hypercalls:
>> + *
>> + * pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
>> + * pv_kick(cpu) -- wakes a suspended vcpu
>> + *
>> + * Using these we implement __pv_queue_spin_lock_slowpath() and
>> + * __pv_queue_spin_unlock() to replace native_queue_spin_lock_slowpath() and
>> + * native_queue_spin_unlock().
>> + */
>> +
>> +#define _Q_SLOW_VAL (3U<< _Q_LOCKED_OFFSET)
>> +
>> +enum vcpu_state {
>> + vcpu_running = 0,
>> + vcpu_halted,
>> +};
>> +
>> +struct pv_node {
>> + struct mcs_spinlock mcs;
>> + struct mcs_spinlock __res[3];
>> +
>> + int cpu;
>> + u8 state;
>> +};
>> +
>> +/*
>> + * Hash table using open addressing with an LFSR probe sequence.
>> + *
>> + * Since we should not be holding locks from NMI context (very rare indeed) the
>> + * max load factor is 0.75, which is around the point where open addressing
>> + * breaks down.
>> + *
>> + * Instead of probing just the immediate bucket we probe all buckets in the
>> + * same cacheline.
>> + *
>> + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
>> + *
>> + * Dynamically allocate a hash table big enough to hold at least 4X the
>> + * number of possible cpus in the system. Allocation is done on page
>> + * granularity. So the minimum number of hash buckets should be at least
>> + * 256 to fully utilize a 4k page.
>> + */
>> +#define LFSR_MIN_BITS 8
>> +#define LFSR_MAX_BITS (2 + NR_CPUS_BITS)
>> +#if LFSR_MAX_BITS< LFSR_MIN_BITS
>> +#undef LFSR_MAX_BITS
>> +#define LFSR_MAX_BITS LFSR_MIN_BITS
>> +#endif
>> +
>> +struct pv_hash_bucket {
>> + struct qspinlock *lock;
>> + struct pv_node *node;
>> +};
>> +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
>> +#define HB_RESERVED ((struct qspinlock *)1)
> This is unused.
You are right, I will remove that.
>> +
>> +static struct pv_hash_bucket *pv_lock_hash;
>> +static unsigned int pv_lock_hash_bits __read_mostly;
> static unsigned int pv_taps __read_mostly;
It will depend on whether we keep the lfsr code or not.
>> +
>> +#include<linux/hash.h>
>> +#include<linux/lfsr.h>
>> +#include<linux/bootmem.h>
>> +
>> +/*
>> + * Allocate memory for the PV qspinlock hash buckets
>> + *
>> + * This function should be called from the paravirt spinlock initialization
>> + * routine.
>> + */
>> +void __init __pv_init_lock_hash(void)
>> +{
>> + int pv_hash_size = 4 * num_possible_cpus();
>> +
>> + if (pv_hash_size< (1U<< LFSR_MIN_BITS))
>> + pv_hash_size = (1U<< LFSR_MIN_BITS);
>> + /*
>> + * Allocate space from bootmem which should be page-size aligned
>> + * and hence cacheline aligned.
>> + */
>> + pv_lock_hash = alloc_large_system_hash("PV qspinlock",
>> + sizeof(struct pv_hash_bucket),
>> + pv_hash_size, 0, HASH_EARLY,
>> + &pv_lock_hash_bits, NULL,
>> + pv_hash_size, pv_hash_size);
> pv_taps = lfsr_taps(pv_lock_hash_bits);
>
I don't understand what you meant here.
>> +}
>> +
>> +static inline u32 hash_align(u32 hash)
>> +{
>> + return hash& ~(PV_HB_PER_LINE - 1);
>> +}
>> +
>> +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
>> +{
>> + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
>> + struct pv_hash_bucket *hb, *end;
>> +
>> + if (!hash)
>> + hash = 1;
>> +
>> + init_hash = hash;
>> + hb =&pv_lock_hash[hash_align(hash)];
>> + for (;;) {
>> + for (end = hb + PV_HB_PER_LINE; hb< end; hb++) {
>> + if (!cmpxchg(&hb->lock, NULL, lock)) {
>> + WRITE_ONCE(hb->node, node);
>> + /*
>> + * We haven't set the _Q_SLOW_VAL yet. So
>> + * the order of writing doesn't matter.
>> + */
>> + smp_wmb(); /* matches rmb from pv_hash_find */
> This doesn't make sense. Both sites do ->lock first and ->node second.
> No amount of ordering can 'fix' that.
>
> I think we can safely remove this wmb and the rmb below, because the
> required ordering is already provided by setting/observing l->locked ==
> SLOW.
>
Right, I was over-cautious here. Will remove that.
>> + goto done;
>> + }
>> + }
>> +
>> + hash = lfsr(hash, pv_lock_hash_bits, 0);
> Since pv_lock_hash_bits is a variable, you end up running through that
> massive if() forest to find the corresponding tap every single time. It
> cannot compile-time optimize it.
The minimum bits size is now 8. So unless the system has more than 64
vCPUs, it will get the right value in the first if statement.
> Hence:
> hash = lfsr(hash, pv_taps);
>
> (I don't get the bits argument to the lfsr).
>
> In any case, like I said before, I think we should try a linear probe
> sequence first, the lfsr was over engineering from my side.
As I said before, I am fine with removing the lfsr() code.
>> + hb =&pv_lock_hash[hash_align(hash)];
>> + BUG_ON(hash == init_hash);
>> + }
>> +
>> +done:
>> + return&hb->lock;
>> +}
>> +
>> +static struct pv_node *pv_hash_find(struct qspinlock *lock)
>> +{
>> + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
>> + struct pv_hash_bucket *hb, *end;
>> + struct pv_node *node = NULL;
>> +
>> + if (!hash)
>> + hash = 1;
>> +
>> + init_hash = hash;
>> + hb =&pv_lock_hash[hash_align(hash)];
>> + for (;;) {
>> + for (end = hb + PV_HB_PER_LINE; hb< end; hb++) {
>> + struct qspinlock *l = READ_ONCE(hb->lock);
>> +
>> + if (l == lock) {
>> + smp_rmb(); /* matches wmb from pv_hash() */
> per the above this can go, IF we observe SLOW we must also observe a
> consistent bucket.
Will remove that.
>> + node = READ_ONCE(hb->node);
>> + goto done;
>> + }
>> + }
>> +
>> + hash = lfsr(hash, pv_lock_hash_bits, 0);
> idem the previous lfsr comment.
>
>> + hb =&pv_lock_hash[hash_align(hash)];
>> + BUG_ON(hash == init_hash);
>> + }
>> +done:
>> + /*
>> + * Clear the hash bucket
>> + */
>> + WRITE_ONCE(hb->lock, NULL);
>> + return node;
>> +}
>> +/*
>> + * Wait for l->locked to become clear; halt the vcpu after a short spin.
>> + * __pv_queue_spin_unlock() will wake us.
>> + */
>> +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> +{
>> + struct __qspinlock *l = (void *)lock;
>> + struct qspinlock **lp = NULL;
>> + struct pv_node *pn = (struct pv_node *)node;
>> + int slow_set = false;
>> + int loop;
>> +
>> + for (;;) {
>> + for (loop = SPIN_THRESHOLD; loop; loop--) {
>> + if (!READ_ONCE(l->locked))
>> + return;
>> +
>> + cpu_relax();
>> + }
>> +
>> + WRITE_ONCE(pn->state, vcpu_halted);
>> + if (!lp)
>> + lp = pv_hash(lock, pn);
>> + /*
>> + * lp must be set before setting _Q_SLOW_VAL
>> + *
>> + * [S] lp = lock [RmW] l = l->locked = 0
>> + * MB MB
>> + * [S] l->locked = _Q_SLOW_VAL [L] lp
>> + *
>> + * Matches the cmpxchg() in pv_queue_spin_unlock().
>> + */
>> + if (!slow_set&&
>> + !cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
>> + /*
>> + * The lock is free and _Q_SLOW_VAL has never been
>> + * set. Need to clear the hash bucket before getting
>> + * the lock.
>> + */
>> + WRITE_ONCE(*lp, NULL);
>> + return;
>> + } else if (slow_set&& !READ_ONCE(l->locked))
>> + return;
>> + slow_set = true;
> I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
> I had, namely:
>
> if (!lp) {
> lp = pv_hash(lock, pn);
>
> /*
> * comment
> */
> lv = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
> if (lv != _Q_LOCKED_VAL) {
> /* we're woken, unhash and return */
> WRITE_ONCE(*lp, NULL);
> return;
> }
> }
>> +
>> + pv_wait(&l->locked, _Q_SLOW_VAL);
>
> If we get a spurious wakeup (due to device interrupts or random kick)
> we'll loop around but ->locked will remain _Q_SLOW_VAL.
The purpose of the slow_set flag is not about the lock value. It is to
make sure that pv_hash_find() will always find a match. Consider the
following scenario:
cpu1 cpu2 cpu3
---- ---- ----
pv_wait
spurious wakeup
loop l->locked
read _Q_SLOW_VAL
pv_hash_find()
unlock
pv_hash() <- same entry
cmpxchg(&l->locked)
clear hash (?)
Here, the entry for cpu3 will be removed leading to panic when
pv_hash_find() can find the entry. So the hash entry can only be
removed if the other cpu has no chance to see _Q_SLOW_VAL.
>> + }
>> + /*
>> + * Lock is unlocked now; the caller will acquire it without waiting.
>> + * As with pv_wait_node() we rely on the caller to do a load-acquire
>> + * for us.
>> + */
>> +}
>> +
>> +/*
>> + * To be used in stead of queue_spin_unlock() for paravirt locks. Wakes
>> + * pv_wait_head() if appropriate.
>> + */
>> +__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
>> +{
>> + struct __qspinlock *l = (void *)lock;
>> + struct pv_node *node;
>> +
>> + if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
>> + return;
>> +
>> + /*
>> + * The queue head has been halted. Need to locate it and wake it up.
>> + */
>> + node = pv_hash_find(lock);
>> + smp_store_release(&l->locked, 0);
> Ah yes, clever that.
>
>> + /*
>> + * At this point the memory pointed at by lock can be freed/reused,
>> + * however we can still use the PV node to kick the CPU.
>> + */
>> + if (READ_ONCE(node->state) == vcpu_halted)
>> + pv_kick(node->cpu);
>> +}
>> +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
> However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
> code.
That is why I originally put my version of the qspinlock_paravirt.h
header file under arch/x86/include/asm. Maybe we should move it back
there. Putting the thunk in arch/x86/kernel/kvm.c didn't work when you
consider that the Xen code also need that.
Cheers,
Longman
On 04/09/2015 03:57 PM, Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote:
>
>> @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
>> }
>>
>> /*
>> + * Called after setting next->locked = 1& lock acquired.
>> + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
>> + * and put an entry into the lock hash table to be waken up at unlock time.
>> */
>> -static void pv_kick_node(struct mcs_spinlock *node)
>> +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
> I'm not too sure about that name change..
It is because the function will no longer kick the cpu. So I change the
name to reflect that. I am not good at naming. Please let me know if you
have a better name.
Cheers,
Longman
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
> >>+void __init __pv_init_lock_hash(void)
> >>+{
> >>+ int pv_hash_size = 4 * num_possible_cpus();
> >>+
> >>+ if (pv_hash_size< (1U<< LFSR_MIN_BITS))
> >>+ pv_hash_size = (1U<< LFSR_MIN_BITS);
> >>+ /*
> >>+ * Allocate space from bootmem which should be page-size aligned
> >>+ * and hence cacheline aligned.
> >>+ */
> >>+ pv_lock_hash = alloc_large_system_hash("PV qspinlock",
> >>+ sizeof(struct pv_hash_bucket),
> >>+ pv_hash_size, 0, HASH_EARLY,
> >>+ &pv_lock_hash_bits, NULL,
> >>+ pv_hash_size, pv_hash_size);
> > pv_taps = lfsr_taps(pv_lock_hash_bits);
> >
>
> I don't understand what you meant here.
Let me explain (even though I propose taking all the LFSR stuff out).
pv_lock_hash_bit is a runtime variable, therefore it cannot compile time
evaluate the forest of if statements required to compute the taps value.
Therefore its best to compute the taps _once_, and this seems like a
good place to do so.
> >>+ goto done;
> >>+ }
> >>+ }
> >>+
> >>+ hash = lfsr(hash, pv_lock_hash_bits, 0);
> >Since pv_lock_hash_bits is a variable, you end up running through that
> >massive if() forest to find the corresponding tap every single time. It
> >cannot compile-time optimize it.
>
> The minimum bits size is now 8. So unless the system has more than 64 vCPUs,
> it will get the right value in the first if statement.
Still, no reason to not pre-compute the taps value, its simple enough.
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
> >>+static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> >>+{
> >>+ struct __qspinlock *l = (void *)lock;
> >>+ struct qspinlock **lp = NULL;
> >>+ struct pv_node *pn = (struct pv_node *)node;
> >>+ int slow_set = false;
> >>+ int loop;
> >>+
> >>+ for (;;) {
> >>+ for (loop = SPIN_THRESHOLD; loop; loop--) {
> >>+ if (!READ_ONCE(l->locked))
> >>+ return;
> >>+
> >>+ cpu_relax();
> >>+ }
> >>+
> >>+ WRITE_ONCE(pn->state, vcpu_halted);
> >>+ if (!lp)
> >>+ lp = pv_hash(lock, pn);
> >>+ /*
> >>+ * lp must be set before setting _Q_SLOW_VAL
> >>+ *
> >>+ * [S] lp = lock [RmW] l = l->locked = 0
> >>+ * MB MB
> >>+ * [S] l->locked = _Q_SLOW_VAL [L] lp
> >>+ *
> >>+ * Matches the cmpxchg() in pv_queue_spin_unlock().
> >>+ */
> >>+ if (!slow_set&&
> >>+ !cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
> >>+ /*
> >>+ * The lock is free and _Q_SLOW_VAL has never been
> >>+ * set. Need to clear the hash bucket before getting
> >>+ * the lock.
> >>+ */
> >>+ WRITE_ONCE(*lp, NULL);
> >>+ return;
> >>+ } else if (slow_set&& !READ_ONCE(l->locked))
> >>+ return;
> >>+ slow_set = true;
> >I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
> >I had, namely:
> >
> > if (!lp) {
> > lp = pv_hash(lock, pn);
> >
> > /*
> > * comment
> > */
> > lv = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
> > if (lv != _Q_LOCKED_VAL) {
> > /* we're woken, unhash and return */
> > WRITE_ONCE(*lp, NULL);
> > return;
> > }
> > }
> >>+
> >>+ pv_wait(&l->locked, _Q_SLOW_VAL);
> >
> >If we get a spurious wakeup (due to device interrupts or random kick)
> >we'll loop around but ->locked will remain _Q_SLOW_VAL.
>
> The purpose of the slow_set flag is not about the lock value. It is to make
> sure that pv_hash_find() will always find a match. Consider the following
> scenario:
>
> cpu1 cpu2 cpu3
> ---- ---- ----
> pv_wait
> spurious wakeup
> loop l->locked
>
> read _Q_SLOW_VAL
> pv_hash_find()
> unlock
>
> pv_hash() <- same entry
>
> cmpxchg(&l->locked)
> clear hash (?)
>
> Here, the entry for cpu3 will be removed leading to panic when
> pv_hash_find() can find the entry. So the hash entry can only be
> removed if the other cpu has no chance to see _Q_SLOW_VAL.
Still confused. Afaict that cannot happen with my code. We only do the
cmpxchg(, _Q_SLOW_VAL) _once_.
Only on the first time around that loop will we hash the lock and set
the slow flag. And cpu3 cannot come in on the same entry because we've
not yet released the lock when we find and unhash.
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
> >>+__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
> >>+{
> >>+ struct __qspinlock *l = (void *)lock;
> >>+ struct pv_node *node;
> >>+
> >>+ if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
> >>+ return;
> >>+
> >>+ /*
> >>+ * The queue head has been halted. Need to locate it and wake it up.
> >>+ */
> >>+ node = pv_hash_find(lock);
> >>+ smp_store_release(&l->locked, 0);
> >Ah yes, clever that.
> >
> >>+ /*
> >>+ * At this point the memory pointed at by lock can be freed/reused,
> >>+ * however we can still use the PV node to kick the CPU.
> >>+ */
> >>+ if (READ_ONCE(node->state) == vcpu_halted)
> >>+ pv_kick(node->cpu);
> >>+}
> >>+PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
> >However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
> >code.
>
> That is why I originally put my version of the qspinlock_paravirt.h header
> file under arch/x86/include/asm. Maybe we should move it back there. Putting
> the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the
> Xen code also need that.
Well the function is 'generic' and belong here I think. Its just the
PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in
arch/x86/kernel/paravirt-spinlocks.c instead?
On 04/13/2015 10:47 AM, Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
>>>> +void __init __pv_init_lock_hash(void)
>>>> +{
>>>> + int pv_hash_size = 4 * num_possible_cpus();
>>>> +
>>>> + if (pv_hash_size< (1U<< LFSR_MIN_BITS))
>>>> + pv_hash_size = (1U<< LFSR_MIN_BITS);
>>>> + /*
>>>> + * Allocate space from bootmem which should be page-size aligned
>>>> + * and hence cacheline aligned.
>>>> + */
>>>> + pv_lock_hash = alloc_large_system_hash("PV qspinlock",
>>>> + sizeof(struct pv_hash_bucket),
>>>> + pv_hash_size, 0, HASH_EARLY,
>>>> + &pv_lock_hash_bits, NULL,
>>>> + pv_hash_size, pv_hash_size);
>>> pv_taps = lfsr_taps(pv_lock_hash_bits);
>>>
>> I don't understand what you meant here.
> Let me explain (even though I propose taking all the LFSR stuff out).
>
> pv_lock_hash_bit is a runtime variable, therefore it cannot compile time
> evaluate the forest of if statements required to compute the taps value.
>
> Therefore its best to compute the taps _once_, and this seems like a
> good place to do so.
OK, I got it. That make sense.
>>>> + goto done;
>>>> + }
>>>> + }
>>>> +
>>>> + hash = lfsr(hash, pv_lock_hash_bits, 0);
>>> Since pv_lock_hash_bits is a variable, you end up running through that
>>> massive if() forest to find the corresponding tap every single time. It
>>> cannot compile-time optimize it.
>> The minimum bits size is now 8. So unless the system has more than 64 vCPUs,
>> it will get the right value in the first if statement.
> Still, no reason to not pre-compute the taps value, its simple enough.
>
Still, we need to keep the hash_bits value as it will needed by the
hashing function.
I have taken out the lfsr code and use linear probing in the updated
qspinlock patch that I am working on. However, we can always add that
back in as an additional patch.
Cheers,
Longman
On 04/13/2015 11:08 AM, Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
>
>>>> +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>>>> +{
>>>> + struct __qspinlock *l = (void *)lock;
>>>> + struct qspinlock **lp = NULL;
>>>> + struct pv_node *pn = (struct pv_node *)node;
>>>> + int slow_set = false;
>>>> + int loop;
>>>> +
>>>> + for (;;) {
>>>> + for (loop = SPIN_THRESHOLD; loop; loop--) {
>>>> + if (!READ_ONCE(l->locked))
>>>> + return;
>>>> +
>>>> + cpu_relax();
>>>> + }
>>>> +
>>>> + WRITE_ONCE(pn->state, vcpu_halted);
>>>> + if (!lp)
>>>> + lp = pv_hash(lock, pn);
>>>> + /*
>>>> + * lp must be set before setting _Q_SLOW_VAL
>>>> + *
>>>> + * [S] lp = lock [RmW] l = l->locked = 0
>>>> + * MB MB
>>>> + * [S] l->locked = _Q_SLOW_VAL [L] lp
>>>> + *
>>>> + * Matches the cmpxchg() in pv_queue_spin_unlock().
>>>> + */
>>>> + if (!slow_set&&
>>>> + !cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
>>>> + /*
>>>> + * The lock is free and _Q_SLOW_VAL has never been
>>>> + * set. Need to clear the hash bucket before getting
>>>> + * the lock.
>>>> + */
>>>> + WRITE_ONCE(*lp, NULL);
>>>> + return;
>>>> + } else if (slow_set&& !READ_ONCE(l->locked))
>>>> + return;
>>>> + slow_set = true;
>>> I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
>>> I had, namely:
>>>
>>> if (!lp) {
>>> lp = pv_hash(lock, pn);
>>>
>>> /*
>>> * comment
>>> */
>>> lv = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
>>> if (lv != _Q_LOCKED_VAL) {
>>> /* we're woken, unhash and return */
>>> WRITE_ONCE(*lp, NULL);
>>> return;
>>> }
>>> }
>>>> +
>>>> + pv_wait(&l->locked, _Q_SLOW_VAL);
>>> If we get a spurious wakeup (due to device interrupts or random kick)
>>> we'll loop around but ->locked will remain _Q_SLOW_VAL.
>> The purpose of the slow_set flag is not about the lock value. It is to make
>> sure that pv_hash_find() will always find a match. Consider the following
>> scenario:
>>
>> cpu1 cpu2 cpu3
>> ---- ---- ----
>> pv_wait
>> spurious wakeup
>> loop l->locked
>>
>> read _Q_SLOW_VAL
>> pv_hash_find()
>> unlock
>>
>> pv_hash()<- same entry
>>
>> cmpxchg(&l->locked)
>> clear hash (?)
>>
>> Here, the entry for cpu3 will be removed leading to panic when
>> pv_hash_find() can find the entry. So the hash entry can only be
>> removed if the other cpu has no chance to see _Q_SLOW_VAL.
> Still confused. Afaict that cannot happen with my code. We only do the
> cmpxchg(, _Q_SLOW_VAL) _once_.
>
> Only on the first time around that loop will we hash the lock and set
> the slow flag. And cpu3 cannot come in on the same entry because we've
> not yet released the lock when we find and unhash.
>
>
Maybe I am not clear in my illustration. More than one locks can be
hashed to the same value. So cpu3 is accessing a different lock which
has the same hashed value as the lock used by cpu1 and cpu2.
Anyway, I remove the slow_set flag by unrolling the retry loop so that
after pv_wait(), it goes into the 2nd loop instead of going back to the
top. As a result, cmpxchg and pv_hash can only be called once.
Cheers,
Longman
On 04/13/2015 11:09 AM, Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
>>>> +__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
>>>> +{
>>>> + struct __qspinlock *l = (void *)lock;
>>>> + struct pv_node *node;
>>>> +
>>>> + if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
>>>> + return;
>>>> +
>>>> + /*
>>>> + * The queue head has been halted. Need to locate it and wake it up.
>>>> + */
>>>> + node = pv_hash_find(lock);
>>>> + smp_store_release(&l->locked, 0);
>>> Ah yes, clever that.
>>>
>>>> + /*
>>>> + * At this point the memory pointed at by lock can be freed/reused,
>>>> + * however we can still use the PV node to kick the CPU.
>>>> + */
>>>> + if (READ_ONCE(node->state) == vcpu_halted)
>>>> + pv_kick(node->cpu);
>>>> +}
>>>> +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
>>> However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
>>> code.
>> That is why I originally put my version of the qspinlock_paravirt.h header
>> file under arch/x86/include/asm. Maybe we should move it back there. Putting
>> the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the
>> Xen code also need that.
> Well the function is 'generic' and belong here I think. Its just the
> PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in
> arch/x86/kernel/paravirt-spinlocks.c instead?
Another alternative is to put the PV_CALLEE_SAVE_REGS_THUNK into a
separate asm/qspinlock_paravirt.h file and included in the generic
qspinlock_paravirt.h file. By putting the thunk with the
__pv_queue_spin_unlock together in the same file, we can make the thunk
and the unlock fast path (_Q_SLOW_VAL not set) go into two adjacent
instruction cachelines. I think that may help a bit in term of performance.
Cheers,
Longman