v2->v3:
- Moved deferred kicking enablement patch forward & move back
the kick-ahead patch to make the effect of kick-ahead more visible.
- Reworked patch 6 to make it more readable.
- Reverted back to use state as a tri-state variable instead of
adding an additional bistate variable.
- Added performance data for different values of PV_KICK_AHEAD_MAX.
- Add a new patch to optimize PV unlock code path performance.
v1->v2:
- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.
This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.
Patch 1 simplifies the unlock code by doing unconditional vCPU kick
when _Q_SLOW_VAL is set as the chance of spurious wakeup showing
up in the statistical data that I collected was very low (1 or 2
occasionally).
Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.
Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.
Patch 4 is an enablement patch for deferring vCPU kickings from the
lock side to the unlock side.
Patch 5 enables multiple vCPU kick-ahead's at unlock time, outside of
the critical section which can improve performance in overcommitted
guests and sometime even in normal guests.
Patch 6 enables adaptive spinning in the queue nodes. This patch can
lead to pretty big performance increase in over-committed guest at
the expense of a slight performance hit in normal guests.
Patch 7 optimizes the PV unlock code path performance.
Performance measurements were done on a 32-CPU Westmere-EX and
Haswell-EX systems. The Westmere-EX system got the most performance
gain from patch 5, whereas the Haswell-EX system got the most gain
from patch 6 for over-committed guests.
The table below shows the Linux kernel build times for various
values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on
the Westmere-EX system:
PV_KICK_AHEAD_MAX Patches 1-5 Patches 1-6
----------------- ----------- -----------
1 9m46.9s 11m10.1s
2 9m40.2s 10m08.3s
3 9m36.8s 9m49.8s
4 9m35.9s 9m38.7s
5 9m35.1s 9m33.0s
6 9m35.7s 9m28.5s
With patches 1-5, the performance wasn't very sensitive to different
PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes
the picture quite dramatically. There is a performance regression if
PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing
PV_KICK_AHEAD_MAX only gets us a minor benefit.
Waiman Long (7):
locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
locking/pvqspinlock: Add pending bit support
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call
locking/pvqspinlock: Allow vCPUs kick-ahead
locking/pvqspinlock: Queue node adaptive spinning
locking/pvqspinlock, x86: Optimize PV unlock code path
arch/x86/Kconfig | 7 +
arch/x86/include/asm/qspinlock_paravirt.h | 60 ++++
kernel/locking/qspinlock.c | 38 ++-
kernel/locking/qspinlock_paravirt.h | 537 +++++++++++++++++++++++++++--
4 files changed, 604 insertions(+), 38 deletions(-)
The smp_store_release() is not a full barrier. In order to avoid missed
wakeup, we may need to add memory barrier around locked and cpu state
variables adding to complexity. As the chance of spurious wakeup is very
low, it is easier and safer to just do an unconditional kick at unlock
time.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 15d3733..37bc363 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -240,7 +240,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}
- WRITE_ONCE(pn->state, vcpu_halted);
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
@@ -321,8 +320,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
- pv_kick(node->cpu);
+ pv_kick(node->cpu);
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1
Like the native qspinlock, using the pending bit when it is lightly
loaded to acquire the lock is faster than going through the PV queuing
process which is even slower than the native queuing process. It also
avoids loading two additional cachelines (the MCS and PV nodes).
This patch adds the pending bit support for PV qspinlock. The pending
bit code has a smaller spin threshold (1<<10). It will default back
to the queuing method if it cannot acquired the lock within a certain
time limit.
On a VM with 32 vCPUs on a 32-core Westmere-EX box, the kernel
build times on 4.2-rc1 based kernels were:
Kernel Build Time Sys Time
------ ---------- --------
w/o patch 3m28.5s 28m17.5s
with patch 3m19.3s 23m55.7s
Using a locking microbenchmark on the same system, the locking
rates in (kops/s) were:
Threads Rate w/o patch Rate with patch
------- -------------- ---------------
2 (same socket) 6,515,265 7,077,476
2 (diff sockets) 2,967,145 4,353,851
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 27 ++++++++++++++-
kernel/locking/qspinlock_paravirt.h | 66 +++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+), 1 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..6518ee9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
}
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->pending, 0);
+}
+
/*
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
@@ -193,6 +204,15 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
}
/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL, &lock->val);
+}
+
+/**
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
* @tail : The new queue tail code word
@@ -245,6 +265,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }
#define pv_enabled() false
+#define pv_pending_lock(l, v) false
#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
@@ -286,8 +307,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
- if (pv_enabled())
+ if (pv_enabled()) {
+ if (pv_pending_lock(lock, val))
+ return; /* Got the lock via pending bit */
goto queue;
+ }
if (virt_queued_spin_lock(lock))
return;
@@ -463,6 +487,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_wait_node
#undef pv_kick_node
#undef pv_wait_head
+#undef pv_pending_lock
#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 37bc363..d64f054 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -22,6 +22,14 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+/*
+ * Queued Spinlock Spin Threshold
+ *
+ * The vCPU will spin a relatively short time in pending mode before falling
+ * back to queuing.
+ */
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
@@ -152,6 +160,64 @@ static void pv_init_node(struct mcs_spinlock *node)
}
/*
+ * Try to acquire the lock and wait using the pending bit
+ */
+static int pv_pending_lock(struct qspinlock *lock, u32 val)
+{
+ int loop = PENDING_SPIN_THRESHOLD;
+ u32 new, old;
+
+ /*
+ * wait for in-progress pending->locked hand-overs
+ */
+ if (val == _Q_PENDING_VAL) {
+ while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
+ loop--)
+ cpu_relax();
+ }
+
+ /*
+ * trylock || pending
+ */
+ for (;;) {
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+ if (loop-- <= 0)
+ goto queue;
+ }
+
+ if (new == _Q_LOCKED_VAL)
+ goto gotlock;
+ /*
+ * We are pending, wait for the owner to go away.
+ */
+ while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
+ && (loop-- > 0))
+ cpu_relax();
+
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ goto gotlock;
+ }
+ /*
+ * Clear the pending bit and fall back to queuing
+ */
+ clear_pending(lock);
+
+queue:
+ return 0;
+
+gotlock:
+ return 1;
+}
+
+/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to wake the vcpu again.
*/
--
1.7.1
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.
The measured latencies for different CPUs are:
CPU Wakeup Kicking
--- ------ -------
Haswell-EX 89.8us 7.4us
Westmere-EX 67.6us 9.3us
The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.
A sample of statistics counts after a kernel build (no CPU overcommit)
was:
hash_hops_count=43
kick_latencies=5783565492
kick_time_count=640269
lock_kick_count=640238
pending_fail_count=10672
pending_lock_count=2946871
spurious_wakeup=14
unlock_kick_count=38
wait_again_count=4
wait_head_count=41
wait_node_count=640242
wake_latencies=42491684295
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++
kernel/locking/qspinlock_paravirt.h | 178 ++++++++++++++++++++++++++++++++++-
2 files changed, 181 insertions(+), 4 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..299a1c4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -679,6 +679,13 @@ config PARAVIRT_SPINLOCKS
If you are unsure how to answer this question, answer Y.
+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+
source "arch/x86/xen/Kconfig"
config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index d64f054..db8d08b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -44,6 +44,153 @@ struct pv_node {
};
/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_wait_again,
+ pvstat_kick_time,
+ pvstat_lock_kick,
+ pvstat_unlock_kick,
+ pvstat_pend_lock,
+ pvstat_pend_fail,
+ pvstat_spurious,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_again] = "wait_again_count",
+ [pvstat_kick_time] = "kick_time_count",
+ [pvstat_lock_kick] = "lock_kick_count",
+ [pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_pend_lock] = "pending_lock_count",
+ [pvstat_pend_fail] = "pending_fail_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg wake latency = pv_wake_latencies/kick_time_count
+ * Avg # of hops/hash = hash_hops_count/unlock_kick_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+ *pkick_time = 0;
+ pv_wait(ptr, val);
+ if (*pkick_time) {
+ atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_time);
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -103,10 +250,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;
for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -209,11 +359,13 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
* Clear the pending bit and fall back to queuing
*/
clear_pending(lock);
+ pvstat_inc(pvstat_pend_fail);
queue:
return 0;
gotlock:
+ pvstat_inc(pvstat_pend_lock);
return 1;
}
@@ -224,9 +376,10 @@ gotlock:
static void pv_wait_node(struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ int waitcnt = 0;
int loop;
- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
@@ -244,14 +397,20 @@ static void pv_wait_node(struct mcs_spinlock *node)
*/
smp_store_mb(pn->state, vcpu_halted);
- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&pn->state, vcpu_halted);
+ }
/*
* Reset the vCPU state to avoid unncessary CPU kicking
*/
WRITE_ONCE(pn->state, vcpu_running);
+ if (READ_ONCE(node->locked))
+ break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -259,6 +418,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious);
}
/*
* By now our node->locked should be 1 and our caller will not actually
@@ -284,8 +444,10 @@ static void pv_kick_node(struct mcs_spinlock *node)
*
* See the comment in pv_wait_node().
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
+ if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
+ pvstat_inc(pvstat_lock_kick);
pv_kick(pn->cpu);
+ }
}
/*
@@ -297,9 +459,10 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct pv_node *pn = (struct pv_node *)node;
struct __qspinlock *l = (void *)lock;
struct qspinlock **lp = NULL;
+ int waitcnt = 0;
int loop;
- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
@@ -327,14 +490,20 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&l->locked, _Q_SLOW_VAL);
+ if (!READ_ONCE(l->locked))
+ return;
/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
* wakeup and so the vCPU should wait again after spinning for
* a while.
*/
+ pvstat_inc(pvstat_spurious);
}
/*
@@ -386,6 +555,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
*/
+ pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
}
/*
--
1.7.1
Most of the vCPU kickings are done on the locking side where the new
lock holder wake up the queue head vCPU to spin on the lock. However,
there are situations where it may be advantageous to defer the vCPU
kicking to when the lock holder releases the lock.
This patch enables the deferment of vCPU kicking to the unlock function
by adding a new vCPU state (vcpu_hashed) to marks the fact that
1) _Q_SLOW_VAL is set in the lock, and
2) the pv_node address is stored in the hash table
This enablement patch, by itself, should not change the performance
of the pvqspinlock code. Actual deferment vCPU kicks will be added
in a later patch.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 6 +++---
kernel/locking/qspinlock_paravirt.h | 34 ++++++++++++++++++++++++----------
2 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6518ee9..94fdd27 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -259,8 +259,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }
@@ -464,7 +464,7 @@ queue:
cpu_relax();
arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);
release:
/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index db8d08b..a4986c6 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -33,6 +33,7 @@
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+ vcpu_hashed, /* vcpu_halted + node stored in hash table */
};
struct pv_node {
@@ -404,13 +405,17 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(&pn->state, vcpu_halted);
}
+ if (READ_ONCE(node->locked))
+ break;
+
/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * Reset the vCPU state to running to avoid unncessary CPU
+ * kicking unless vcpu_hashed had already been set. In this
+ * case, node->locked should have just been set, and we
+ * aren't going to set state to vcpu_halted again.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);
- if (READ_ONCE(node->locked))
- break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -429,12 +434,16 @@ static void pv_wait_node(struct mcs_spinlock *node)
/*
* Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * pv_wait_node(). Alternatively, it can also defer the kicking to the
+ * unlock function.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ if (xchg(&pn->state, vcpu_running) != vcpu_halted)
+ return;
+
/*
* Note that because node->locked is already set, this actual
* mcs_spinlock entry could be re-used already.
@@ -444,10 +453,8 @@ static void pv_kick_node(struct mcs_spinlock *node)
*
* See the comment in pv_wait_node().
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
- pvstat_inc(pvstat_lock_kick);
- pv_kick(pn->cpu);
- }
+ pvstat_inc(pvstat_lock_kick);
+ pv_kick(pn->cpu);
}
/*
@@ -469,6 +476,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}
+ if (!lp && (xchg(&pn->state, vcpu_hashed) == vcpu_hashed))
+ /*
+ * The hashed table & _Q_SLOW_VAL had been filled
+ * by the lock holder.
+ */
+ lp = (struct qspinlock **)-1;
+
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
--
1.7.1
Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
critical section and block forward progress. This patch implements
a kick-ahead mechanism where the unlocker will kick the queue head
vCPUs as well as up to four additional vCPUs next to the queue head
if they were halted. The kickings are done after exiting the critical
section to improve parallelism.
The amount of kick-ahead allowed depends on the number of vCPUs
in the VM guest. Currently it allows up to 1 vCPU kick-ahead per
4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX (4). There
are diminishing returns in increasing the maximum value. The current
value of 4 is a compromise of getting a nice performance boost without
penalizing too much on the one vCPU that is doing all the kickings.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m42.3s 10m27.5s 2m00.7s 17m22.2s
After patch 3m02.3s 9m35.9s 1m59.8s 16m57.6s
This improves performance quite substantially on Westmere, but not
so much on Haswell.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 108 ++++++++++++++++++++++++++++++++---
1 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index a4986c6..96c3833 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -54,6 +54,7 @@ enum pv_qlock_stat {
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
+ pvstat_kick_ahead,
pvstat_pend_lock,
pvstat_pend_fail,
pvstat_spurious,
@@ -75,6 +76,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_kick_ahead] = "kick_ahead_count",
[pvstat_pend_lock] = "pending_lock_count",
[pvstat_pend_fail] = "pending_fail_count",
[pvstat_spurious] = "spurious_wakeup",
@@ -87,7 +89,8 @@ static atomic_t pvstats[pvstat_num];
* pv_kick_latencies = sum of all pv_kick latencies in ns
* pv_wake_latencies = sum of all wakeup latencies in ns
*
- * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg kick latency = pv_kick_latencies/
+ * (lock_kick_count + unlock_kick_count + kick_ahead_count)
* Avg wake latency = pv_wake_latencies/kick_time_count
* Avg # of hops/hash = hash_hops_count/unlock_kick_count
*/
@@ -219,6 +222,18 @@ static struct pv_hash_entry *pv_lock_hash;
static unsigned int pv_lock_hash_bits __read_mostly;
/*
+ * Allow kick-ahead of vCPUs at unlock time
+ *
+ * The pv_kick_ahead value is set by a simple formula that 1 vCPU kick-ahead
+ * is allowed per 4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX.
+ * There are diminishing returns in increasing PV_KICK_AHEAD_MAX. The current
+ * value of 4 is a good compromise that gives a good performance boost without
+ * penalizing the vCPU that is doing the kicking by too much.
+ */
+#define PV_KICK_AHEAD_MAX 4
+static int pv_kick_ahead __read_mostly;
+
+/*
* Allocate memory for the PV qspinlock hash buckets
*
* This function should be called from the paravirt spinlock initialization
@@ -226,7 +241,8 @@ static unsigned int pv_lock_hash_bits __read_mostly;
*/
void __init __pv_init_lock_hash(void)
{
- int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+ int ncpus = num_possible_cpus();
+ int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);
if (pv_hash_size < PV_HE_MIN)
pv_hash_size = PV_HE_MIN;
@@ -240,6 +256,13 @@ void __init __pv_init_lock_hash(void)
pv_hash_size, 0, HASH_EARLY,
&pv_lock_hash_bits, NULL,
pv_hash_size, pv_hash_size);
+ /*
+ * Enable the unlock kick ahead mode according to the number of
+ * vCPUs available.
+ */
+ pv_kick_ahead = min(ncpus/4, PV_KICK_AHEAD_MAX);
+ if (pv_kick_ahead)
+ pr_info("PV unlock kick ahead max count = %d\n", pv_kick_ahead);
}
#define for_each_hash_entry(he, offset, hash) \
@@ -433,6 +456,28 @@ static void pv_wait_node(struct mcs_spinlock *node)
}
/*
+ * Helper to get the address of the next kickable node
+ *
+ * The node has to be in the halted state. If the chkonly flag is set,
+ * the CPU state won't be changed. Otherwise, it will be transitioned to
+ * running state by this function. If no kickable node is found, NULL
+ * will be returned.
+ */
+static inline struct pv_node *
+pv_get_kick_node(struct pv_node *node, int chkonly)
+{
+ struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
+
+ if (!next || (READ_ONCE(next->state) != vcpu_halted))
+ return NULL;
+
+ if (!chkonly && (xchg(&next->state, vcpu_running) != vcpu_halted))
+ next = NULL; /* No kicking is needed */
+
+ return next;
+}
+
+/*
* Called after setting next->locked = 1, used to wake those stuck in
* pv_wait_node(). Alternatively, it can also defer the kicking to the
* unlock function.
@@ -445,14 +490,36 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
return;
/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * Kicking the next node at lock time can actually be a bit faster
+ * than doing it at unlock time because the critical section time
+ * overlaps with the wakeup latency of the next node. However, if the
+ * VM is too overcommmitted, it can happen that we need to kick the
+ * CPU again at unlock time (double-kick). To avoid that and also to
+ * fully utilize the kick-ahead functionality at unlock time,
+ * the kicking will be deferred under either one of the following
+ * 2 conditions:
+ * 1) The VM guest has too few vCPUs that kick-ahead is not even
+ * enabled. In this case, the chance of double-kick will be
+ * higher.
+ * 2) The node after the next one is also in the halted state.
*
- * This should be fine however, kicking people for no reason is
- * harmless.
- *
- * See the comment in pv_wait_node().
+ * In this case, the state is set to vcpu_hased to indicate that hash
+ * table has been filled and _Q_SLOW_VAL is set.
*/
+ if (!pv_kick_ahead || pv_get_kick_node(pn, 1)) {
+ if (xchg(&pn->state, vcpu_hashed) != vcpu_hashed) {
+ struct __qspinlock *l = (void *)lock;
+
+ /*
+ * As this is the same vCPU that will check the
+ * _Q_SLOW_VAL value and the hash table later on at
+ * unlock time, no atomic instruction is needed.
+ */
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
+ }
+ return;
+ }
pvstat_inc(pvstat_lock_kick);
pv_kick(pn->cpu);
}
@@ -534,7 +601,8 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
{
struct __qspinlock *l = (void *)lock;
- struct pv_node *node;
+ struct pv_node *node, *next;
+ int i, nr_kick, cpus[PV_KICK_AHEAD_MAX];
u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
/*
@@ -560,6 +628,20 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
node = pv_unhash(lock);
/*
+ * Implement unlock kick-ahead
+ *
+ * Access the next group of nodes, if available, and prepare to kick
+ * them after releasing the lock if they are in the halted state. This
+ * should improve performance on an overcommitted system.
+ */
+ for (nr_kick = 0, next = node; nr_kick < pv_kick_ahead; nr_kick++) {
+ next = pv_get_kick_node(next, 0);
+ if (!next)
+ break;
+ cpus[nr_kick] = next->cpu;
+ }
+
+ /*
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
@@ -571,6 +653,14 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
*/
pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
+
+ /*
+ * Kick the next group of vCPUs, if available.
+ */
+ for (i = 0; i < nr_kick; i++) {
+ pvstat_inc(pvstat_kick_ahead);
+ pv_kick(cpus[i]);
+ }
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.
Reducing the spinning threshold is found to improve performance in
an overcommitted VM guest, but decrease performance when there is
no overcommittment.
This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() earlier if all the following conditions are true:
1) the vCPU has not been halted before;
2) the previous vCPU is in the halted state;
3) the current vCPU is at least 2 nodes away from the lock holder;
4) there are a lot of pv_wait() for the current vCPU recently.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 9m35.9s 1m59.8s 16m57.6s
After patch 3m06.5s 9m38.7s 2m01.5s 9m42.3s
This patch seemed to cause a little bit of performance degraduation
for 32 vCPUs. For 48 vCPUs, there wasn't much change for Westmere,
but a pretty big performance jump for Haswell.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 5 +-
kernel/locking/qspinlock_paravirt.h | 132 +++++++++++++++++++++++++++++++++-
2 files changed, 131 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 94fdd27..da39d43 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -258,7 +258,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline void __pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
@@ -415,7 +416,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 96c3833..1861287 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,12 +23,47 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
/*
- * Queued Spinlock Spin Threshold
+ * Queued Spinlock Spin Thresholds
*
* The vCPU will spin a relatively short time in pending mode before falling
* back to queuing.
+ *
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will spin less if the following conditions are all true:
+ * 1) vCPU in the previous node is halted && it has not been halted before
+ * 2) it is at least 2 nodes away from the lock holder
+ * 3) there is a lot of pv_wait() in the curent vCPU recently
+ *
+ * The last condition is being monitored by the waithist field in the pv_node
+ * structure which tracks the history of pv_wait() relative to slowpath calls.
+ * Each pv_wait will increment this field by PV_WAIT_INC until it exceeds
+ * PV_WAITHIST_MAX. Each slowpath lock call will decrement it by 1 until it
+ * reaches PV_WAITHIST_MIN. If its value is higher than PV_WAITHIST_THRESHOLD,
+ * the vCPU will spin less. The reason for this adaptive spinning is to try
+ * to enable wait-early when overcommitted which should cause a lot more
+ * pv_wait's, but don't use it when it is not.
+ *
+ * The queue node vCPU will monitor the state of the previous node
+ * periodically to see if there is any change. If the previous node is
+ * found to be halted, it will call pv_wait() immediately when wait_early
+ * mode is enabled as the wakeup latency is pretty high. On the other, it
+ * won't go to the halted state immediately on entry to pv_wait_node() as
+ * the previous node may be being woken up.
+ *
+ * With PV_WAIT_INC set to 2, each pv_wait() while not in wait-early mode
+ * will increment waithist by 1. Each slowpath call without pv_wait() will
+ * decrement waithist by 1. The threshold is set in a way as to not prefer
+ * enabling wait-early.
*/
-#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD
+#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_CHECK_MASK 0xff
+#define PV_WAIT_INC 2
+#define PV_WAITHIST_MIN 1
+#define PV_WAITHIST_MAX 20
+#define PV_WAITHIST_THRESHOLD 15
enum vcpu_state {
vcpu_running = 0,
@@ -42,6 +77,8 @@ struct pv_node {
int cpu;
u8 state;
+ u8 waithist;
+ u8 wait_early;
};
/*
@@ -51,6 +88,7 @@ enum pv_qlock_stat {
pvstat_wait_head,
pvstat_wait_node,
pvstat_wait_again,
+ pvstat_wait_early,
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
@@ -73,6 +111,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_head] = "wait_head_count",
[pvstat_wait_node] = "wait_node_count",
[pvstat_wait_again] = "wait_again_count",
+ [pvstat_wait_early] = "wait_early_count",
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
@@ -321,6 +360,86 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
}
/*
+ * Helper functions to maintain the waithist field as well as returning the
+ * right spinning threshold and next spin count for queue nodes.
+ */
+static inline void pv_inc_waithist(struct pv_node *node)
+{
+ u8 wait_early = node->wait_early;
+
+ if (wait_early)
+ pvstat_inc(pvstat_wait_early);
+
+ /*
+ * pv_wait() in wait_early mode doesn't count as much as !wait_early
+ */
+ if (node->waithist < PV_WAITHIST_MAX)
+ node->waithist += wait_early ? 1 : PV_WAIT_INC;
+}
+
+static inline void pv_dec_waithist(struct pv_node *node)
+{
+ node->wait_early = 0;
+ if (node->waithist > PV_WAITHIST_MIN)
+ node->waithist--;
+}
+
+static inline int pv_spin_threshold(struct pv_node *node,
+ struct pv_node *prev, int waitcnt)
+{
+ node->wait_early = 0;
+
+ /*
+ * Don't wait early if node has been halted before or the waithist
+ * threshold has not been reached.
+ */
+ if (waitcnt || (node->waithist <= PV_WAITHIST_THRESHOLD))
+ return QNODE_SPIN_THRESHOLD;
+
+ /*
+ * Don't wait early if previous node is a queue head or is running
+ */
+ node->wait_early = !READ_ONCE(prev->mcs.locked) &&
+ (READ_ONCE(prev->state) != vcpu_running);
+
+ return node->wait_early ? QNODE_SPIN_THRESHOLD_SHORT
+ : QNODE_SPIN_THRESHOLD;
+}
+
+static inline int pv_next_spin_count(struct pv_node *node, struct pv_node *prev,
+ int waitcnt, int loop)
+{
+ int wait_early;
+
+ /*
+ * Don't need to make any check if
+ * 1) node has been halted before
+ * 2) it is not time to check yet
+ * 3) wait early is not enabled
+ */
+ if (waitcnt || ((loop & QNODE_SPIN_CHECK_MASK) != 1) ||
+ (node->waithist <= PV_WAITHIST_THRESHOLD))
+ return loop - 1;
+
+ /*
+ * Look for state transition at previous node.
+ *
+ * running => halted:
+ * call pv_wait() now
+ *
+ * halted => running:
+ * reset spin threshold to QNODE_SPIN_THRESHOLD
+ */
+ wait_early = !READ_ONCE(prev->mcs.locked) &&
+ (READ_ONCE(prev->state) != vcpu_running);
+ if (node->wait_early == wait_early)
+ return loop - 1;
+
+ node->wait_early = wait_early;
+ return wait_early ? 0 : QNODE_SPIN_THRESHOLD;
+}
+
+/*
* Initialize the PV part of the mcs_spinlock node.
*/
static void pv_init_node(struct mcs_spinlock *node)
@@ -397,14 +516,16 @@ gotlock:
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to wake the vcpu again.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
int waitcnt = 0;
int loop;
for (;; waitcnt++) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ for (loop = pv_spin_threshold(pn, pp, waitcnt); loop;
+ loop = pv_next_spin_count(pn, pp, waitcnt, loop)) {
if (READ_ONCE(node->locked))
return;
cpu_relax();
@@ -422,6 +543,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
smp_store_mb(pn->state, vcpu_halted);
if (!READ_ONCE(node->locked)) {
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_node);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
@@ -536,6 +658,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
int waitcnt = 0;
int loop;
+ pv_dec_waithist(pn); /* Pre-decremnt the waithist field */
for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -571,6 +694,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_head);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
--
1.7.1
The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.
For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().
This patch optimizes the PV unlock code path by:
1) Moving the unlock slowpath code from the fastpath into a separate
pv_queued_spin_unlock_slowpath() function to make the fastpath
as simple as possible..
2) For x86-64, hand-coded an assembly function to combine the register
saving thunk code with the fastpath code. Only registers that
are used in the fastpath will be saved and restored. If the
fastpath fails, the slowpath function will be called via another
PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
__pv_queued_spin_unlock() code as the thunk saves and restores
only one 32-bit register.
With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box:
Threads Before patch After patch % Change
------- ------------ ----------- --------
1 154.0 ms 119.2 ms -23%
2 1272 ms 1174 ms -7.7%
3 3705 ms 3349 ms -9.6%
4 3767 ms 3597 ms -4.5%
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock_paravirt.h | 60 +++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 31 ++++++++++-----
2 files changed, 81 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..46f0f82 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,66 @@
#ifndef __ASM_QSPINLOCK_PARAVIRT_H
#define __ASM_QSPINLOCK_PARAVIRT_H
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock __pv_queued_spin_unlock
+#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save_pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ * struct __qspinlock *l = (void *)lock;
+ * u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ * if (likely(lockval == _Q_LOCKED_VAL))
+ * return;
+ * pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ * rdi = lock (first argument)
+ * rsi = lockval (second argument)
+ * rdx = internal variable (set to 0)
+ */
+asm(".pushsection .text;"
+ ".globl " PV_UNLOCK ";"
+ ".align 4,0x90;"
+ PV_UNLOCK ": "
+ "push %rdx;"
+ "mov $0x1,%eax;"
+ "xor %edx,%edx;"
+ "lock cmpxchg %dl,(%rdi);"
+ "cmp $0x1,%al;"
+ "jne .slowpath;"
+ "pop %rdx;"
+ "ret;"
+ "nop;"
+ ".slowpath: "
+ "push %rsi;"
+ "movzbl %al,%esi;"
+ "call " PV_UNLOCK_SLOWPATH ";"
+ "pop %rsi;"
+ "pop %rdx;"
+ "ret;"
+ ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+ ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
+#endif /* CONFIG_64BIT */
#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 1861287..56c3717 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -719,24 +719,21 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}
/*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV version of the unlock fastpath and slowpath functions to be used
+ * in stead of queued_spin_unlock().
*/
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node, *next;
int i, nr_kick, cpus[PV_KICK_AHEAD_MAX];
- u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
/*
* We must not unlock if SLOW, because in that case we must first
* unhash. Otherwise it would be possible to have multiple @lock
* entries, which would be BAD.
*/
- if (likely(lockval == _Q_LOCKED_VAL))
- return;
-
if (unlikely(lockval != _Q_SLOW_VAL)) {
if (debug_locks_silent)
return;
@@ -786,12 +783,26 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
pv_kick(cpus[i]);
}
}
+
/*
* Include the architecture specific callee-save thunk of the
* __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
*/
#include <asm/qspinlock_paravirt.h>
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+
+ if (likely(lockval == _Q_LOCKED_VAL))
+ return;
+
+ pv_queued_spin_unlock_slowpath(lock, lockval);
+}
+#endif /* __pv_queued_spin_unlock */
--
1.7.1
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid missed
> wakeup, we may need to add memory barrier around locked and cpu state
> variables adding to complexity. As the chance of spurious wakeup is very
> low, it is easier and safer to just do an unconditional kick at unlock
> time.
Although I guess if SPIN_THRESHOLD is ever enlarged, the chances of
spurious wakeups would be greater.
> Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> Using a locking microbenchmark on the same system, the locking
> rates in (kops/s) were:
>
> Threads Rate w/o patch Rate with patch
> ------- -------------- ---------------
> 2 (same socket) 6,515,265 7,077,476
> 2 (diff sockets) 2,967,145 4,353,851
At this level stddev would be nice to have, as these sort of things have
a lot of variation.
Out of general interest, are these microbenchmarks available? Would you
care to add something like that to perf-bench? I assume they would
probably be x86 specific.
Thanks,
Davidlohr
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> Using a locking microbenchmark on the same system, the locking
> rates in (kops/s) were:
>
> Threads Rate w/o patch Rate with patch
> ------- -------------- ---------------
> 2 (same socket) 6,515,265 7,077,476
> 2 (diff sockets) 2,967,145 4,353,851
At this level stddev would be nice to have, as these sort of things have
a lot of variation.
Out of general interest, are these microbenchmarks available? Would you
care to add something like that to perf-bench? I assume they would
probably be x86 specific.
Thanks,
Davidlohr
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> Like the native qspinlock, using the pending bit when it is lightly
> loaded to acquire the lock is faster than going through the PV queuing
> process which is even slower than the native queuing process. It also
> avoids loading two additional cachelines (the MCS and PV nodes).
>
> This patch adds the pending bit support for PV qspinlock. The pending
> bit code has a smaller spin threshold (1<<10). It will default back
> to the queuing method if it cannot acquired the lock within a certain
> time limit.
Can we infer that this new spin threshold is the metric to detect these
"light loads"? If so, I cannot help but wonder if there is some more
straightforward/ad-hoc way of detecting this, ie some pv_<> function.
That would also save a lot of time as it would not be time based.
Although it might be a more costly call altogether, I dunno.
Some comments about this 'loop' threshold.
> +static int pv_pending_lock(struct qspinlock *lock, u32 val)
> +{
> + int loop = PENDING_SPIN_THRESHOLD;
> + u32 new, old;
> +
> + /*
> + * wait for in-progress pending->locked hand-overs
> + */
> + if (val == _Q_PENDING_VAL) {
> + while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
> + loop--)
> + cpu_relax();
> + }
> +
> + /*
> + * trylock || pending
> + */
> + for (;;) {
> + if (val & ~_Q_LOCKED_MASK)
> + goto queue;
> + new = _Q_LOCKED_VAL;
> + if (val == new)
> + new |= _Q_PENDING_VAL;
> + old = atomic_cmpxchg(&lock->val, val, new);
> + if (old == val)
> + break;
> + if (loop-- <= 0)
> + goto queue;
> + }
So I'm not clear about the semantics of what (should) occurs when the
threshold is exhausted. In the trylock/pending loop above, you
immediately return 0, indicating we want to queue. Ok, but below:
> +
> + if (new == _Q_LOCKED_VAL)
> + goto gotlock;
> + /*
> + * We are pending, wait for the owner to go away.
> + */
> + while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
> + && (loop-- > 0))
> + cpu_relax();
> +
> + if (!(val & _Q_LOCKED_MASK)) {
> + clear_pending_set_locked(lock);
> + goto gotlock;
> + }
> + /*
> + * Clear the pending bit and fall back to queuing
> + */
> + clear_pending(lock);
... you call clear_pending before returning. Is this intentional? Smells
fishy.
And basically afaict all this chunk of code does is spin until loop is
exhausted, and breakout when we got the lock. Ie, something like this is
a lot cleaner:
while (loop--) {
/*
* We are pending, wait for the owner to go away.
*/
val = smp_load_acquire(&lock->val.counter);
if (!(val & _Q_LOCKED_MASK)) {
clear_pending_set_locked(lock);
goto gotlock;
}
cpu_relax();
}
/*
* Clear the pending bit and fall back to queuing
*/
clear_pending(lock);
> +queue:
> + return 0;
> +
> +gotlock:
> + return 1;
> +}
> +
Thanks,
Davidlohr
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> This patch enables the accumulation of kicking and waiting related
> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
> option is selected. It also enables the collection of kicking and
> wakeup latencies which have a heavy dependency on the CPUs being used.
>
> The measured latencies for different CPUs are:
>
> CPU Wakeup Kicking
> --- ------ -------
> Haswell-EX 89.8us 7.4us
> Westmere-EX 67.6us 9.3us
>
But you don't mention anything about the overhead of enabling
QUEUED_LOCK_STAT. This does several atomic ops, thus potentially
thrashing workloads.
On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> Patch 3 allows the collection of various count data that are useful
> to see what is happening in the system. They do add a bit of overhead
> when enabled slowing performance a tiny bit.
Ah ok you do mention the overhead. Hmmm but what's a tiny bit? A kernel
build comparison would be pretty nice to have a "usability" view.
Thanks,
Davidlohr
On Sat, 2015-07-25 at 15:31 -0700, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
> > The smp_store_release() is not a full barrier. In order to avoid missed
> > wakeup, we may need to add memory barrier around locked and cpu state
> > variables adding to complexity. As the chance of spurious wakeup is very
> > low, it is easier and safer to just do an unconditional kick at unlock
> > time.
>
> Although I guess if SPIN_THRESHOLD is ever enlarged, the chances of
> spurious wakeups would be greater.
>
> > Signed-off-by: Waiman Long <[email protected]>
>
> Reviewed-by: Davidlohr Bueso <[email protected]>
Thinking about this some more, as good practice, could you please add a
comment in the code explicitly mentioning the spurious wakeup side
effect? Perhaps even having something more generic for the entire kernel
might be added/created to Documentation/spurious-wakeups.txt?
Thanks,
Davidlohr
On 07/26/2015 07:09 PM, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
>> Using a locking microbenchmark on the same system, the locking
>> rates in (kops/s) were:
>>
>> Threads Rate w/o patch Rate with patch
>> ------- -------------- ---------------
>> 2 (same socket) 6,515,265 7,077,476
>> 2 (diff sockets) 2,967,145 4,353,851
> At this level stddev would be nice to have, as these sort of things have
> a lot of variation.
>
> Out of general interest, are these microbenchmarks available? Would you
> care to add something like that to perf-bench? I assume they would
> probably be x86 specific.
Ingo has asked similar question and he suggested doing an user-space
implementation in perf bench. I will take a further look at that when I
have time.
Cheers,
Longman
On 07/26/2015 08:56 PM, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
>> Like the native qspinlock, using the pending bit when it is lightly
>> loaded to acquire the lock is faster than going through the PV queuing
>> process which is even slower than the native queuing process. It also
>> avoids loading two additional cachelines (the MCS and PV nodes).
>>
>> This patch adds the pending bit support for PV qspinlock. The pending
>> bit code has a smaller spin threshold (1<<10). It will default back
>> to the queuing method if it cannot acquired the lock within a certain
>> time limit.
> Can we infer that this new spin threshold is the metric to detect these
> "light loads"? If so, I cannot help but wonder if there is some more
> straightforward/ad-hoc way of detecting this, ie some pv_<> function.
> That would also save a lot of time as it would not be time based.
> Although it might be a more costly call altogether, I dunno.
I used the term "light load" to refer to the condition that at most 2
competing threads are trying to acquire the lock. In that case, the
pending code will be used. Once there are 3 or more competing threads,
it will switch back to the regular queuing code. It is the same
mechanism used in the native code. The only difference is in the
addition of a loop counter to make sure that the thread won't spend too
much time on spinning.
> Some comments about this 'loop' threshold.
>
>> +static int pv_pending_lock(struct qspinlock *lock, u32 val)
>> +{
>> + int loop = PENDING_SPIN_THRESHOLD;
>> + u32 new, old;
>> +
>> + /*
>> + * wait for in-progress pending->locked hand-overs
>> + */
>> + if (val == _Q_PENDING_VAL) {
>> + while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)&&
>> + loop--)
>> + cpu_relax();
>> + }
>> +
>> + /*
>> + * trylock || pending
>> + */
>> + for (;;) {
>> + if (val& ~_Q_LOCKED_MASK)
>> + goto queue;
>> + new = _Q_LOCKED_VAL;
>> + if (val == new)
>> + new |= _Q_PENDING_VAL;
>> + old = atomic_cmpxchg(&lock->val, val, new);
>> + if (old == val)
>> + break;
>> + if (loop--<= 0)
>> + goto queue;
>> + }
> So I'm not clear about the semantics of what (should) occurs when the
> threshold is exhausted. In the trylock/pending loop above, you
> immediately return 0, indicating we want to queue. Ok, but below:
This is in the lock slowpath, so it can't return a lock failure.
>> +
>> + if (new == _Q_LOCKED_VAL)
>> + goto gotlock;
>> + /*
>> + * We are pending, wait for the owner to go away.
>> + */
>> + while (((val = smp_load_acquire(&lock->val.counter))& _Q_LOCKED_MASK)
>> + && (loop--> 0))
>> + cpu_relax();
>> +
>> + if (!(val& _Q_LOCKED_MASK)) {
>> + clear_pending_set_locked(lock);
>> + goto gotlock;
>> + }
>> + /*
>> + * Clear the pending bit and fall back to queuing
>> + */
>> + clear_pending(lock);
> ... you call clear_pending before returning. Is this intentional? Smells
> fishy.
The pending bit acts as a 1-slot waiting queue. So if the vCPU needs to
fall back to regular queuing, it needs to clear the bit.
>
> And basically afaict all this chunk of code does is spin until loop is
> exhausted, and breakout when we got the lock. Ie, something like this is
> a lot cleaner:
>
> while (loop--) {
> /*
> * We are pending, wait for the owner to go away.
> */
> val = smp_load_acquire(&lock->val.counter);
> if (!(val& _Q_LOCKED_MASK)) {
> clear_pending_set_locked(lock);
> goto gotlock;
> }
>
> cpu_relax();
> }
>
> /*
> * Clear the pending bit and fall back to queuing
> */
> clear_pending(lock);
>
Yes, we could change the loop to that. I was just following the same
logic in the native code.
Cheers,
Longman
On 07/26/2015 09:14 PM, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
>> This patch enables the accumulation of kicking and waiting related
>> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
>> option is selected. It also enables the collection of kicking and
>> wakeup latencies which have a heavy dependency on the CPUs being used.
>>
>> The measured latencies for different CPUs are:
>>
>> CPU Wakeup Kicking
>> --- ------ -------
>> Haswell-EX 89.8us 7.4us
>> Westmere-EX 67.6us 9.3us
>>
> But you don't mention anything about the overhead of enabling
> QUEUED_LOCK_STAT. This does several atomic ops, thus potentially
> thrashing workloads.
>
Yes, QUEUED_LOCK_STAT will slow performance a bit. It is like enabling
LOCK_STAT and you will expect some slow down too. It is essentially a
debugging option to see what had actually happened in the system. It
should be turned off in a production system.
Cheers,
Longman
On 07/26/2015 09:18 PM, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
>> Patch 3 allows the collection of various count data that are useful
>> to see what is happening in the system. They do add a bit of overhead
>> when enabled slowing performance a tiny bit.
> Ah ok you do mention the overhead. Hmmm but what's a tiny bit? A kernel
> build comparison would be pretty nice to have a "usability" view.
>
> Thanks,
> Davidlohr
>
You will probably see a few seconds of slowdown for a kernel build that
took 3 mins. So it is about a few percents and it varies a bit from run
to run.
Cheers,
Longman
On 07/26/2015 09:46 PM, Davidlohr Bueso wrote:
> On Sat, 2015-07-25 at 15:31 -0700, Davidlohr Bueso wrote:
>> On Wed, 2015-07-22 at 16:12 -0400, Waiman Long wrote:
>>> The smp_store_release() is not a full barrier. In order to avoid missed
>>> wakeup, we may need to add memory barrier around locked and cpu state
>>> variables adding to complexity. As the chance of spurious wakeup is very
>>> low, it is easier and safer to just do an unconditional kick at unlock
>>> time.
>> Although I guess if SPIN_THRESHOLD is ever enlarged, the chances of
>> spurious wakeups would be greater.
>>
>>> Signed-off-by: Waiman Long<[email protected]>
>> Reviewed-by: Davidlohr Bueso<[email protected]>
> Thinking about this some more, as good practice, could you please add a
> comment in the code explicitly mentioning the spurious wakeup side
> effect? Perhaps even having something more generic for the entire kernel
> might be added/created to Documentation/spurious-wakeups.txt?
>
> Thanks,
> Davidlohr
>
The if conditional check was added with the intention to save an
unneeded PV kick when the vCPU was running. Doing an unconditional kick
doesn't do any harm other than the additional latency of the PV kick. I
will add a comment when I update the patch.
As for the spurious-wakeups.txt, I saw there are spurious wakeup
occasionally. However, I am not totally sure of the mechanism that
causes it. Also the spurious wakeup here refers to the vmenter of the
vCPU which is different from spurious wakeup of a sleeping thread. I
don't think I have enough data and information to write a document file yet.
Cheers,
Longman
On Mon, 2015-07-27 at 13:50 -0400, Waiman Long wrote:
> As for the spurious-wakeups.txt, I saw there are spurious wakeup
> occasionally. However, I am not totally sure of the mechanism that
> causes it. Also the spurious wakeup here refers to the vmenter of the
> vCPU which is different from spurious wakeup of a sleeping thread. I
> don't think I have enough data and information to write a document file yet.
Right, this was a generic idea, not only regarding actual wakeups of
sleeping threads.
On Mon, 2015-07-27 at 13:30 -0400, Waiman Long wrote:
> The pending bit acts as a 1-slot waiting queue. So if the vCPU needs to
> fall back to regular queuing, it needs to clear the bit.
Right, that's what I thought. So you would also need to call
clear_pending() as soon as the trylock/pending loop hits zero. Or am I
missing something?
/*
* trylock || pending
*/
for (;;) {
...
if (loop--<= 0) {
clear_pending(lock);
goto queue;
}
}
Also, no need to check for negative loop, zero should be enough to
breakout.
Thanks,
Davidlohr
On Mon, 2015-07-27 at 13:30 -0400, Waiman Long wrote:
> Yes, we could change the loop to that. I was just following the same
> logic in the native code.
Well since the native code obviously doesn't involve the
threshold/clear_pending, it looks fine to me. Its just the pv version I
think should be updated.
Thanks,
Davidlohr
On 07/27/2015 03:39 PM, Davidlohr Bueso wrote:
> On Mon, 2015-07-27 at 13:30 -0400, Waiman Long wrote:
>> The pending bit acts as a 1-slot waiting queue. So if the vCPU needs to
>> fall back to regular queuing, it needs to clear the bit.
> Right, that's what I thought. So you would also need to call
> clear_pending() as soon as the trylock/pending loop hits zero. Or am I
> missing something?
>
> /*
> * trylock || pending
> */
> for (;;) {
> ...
> if (loop--<= 0) {
> clear_pending(lock);
> goto queue;
> }
> }
>
> Also, no need to check for negative loop, zero should be enough to
> breakout.
The while loop below can potentially make the loop variable become negative:
/*
* wait for in-progress pending->locked hand-overs
*/
if (val == _Q_PENDING_VAL) {
while (((val = atomic_read(&lock->val)) ==
_Q_PENDING_VAL) &&
loop--)
cpu_relax();
}
Cheers,
Longman
On Wed, Jul 22, 2015 at 04:12:36PM -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid missed
> wakeup, we may need to add memory barrier around locked and cpu state
> variables adding to complexity. As the chance of spurious wakeup is very
> low, it is easier and safer to just do an unconditional kick at unlock
> time.
This needs to better spell out the race; my still sleeping brain doesn't
want to co-operate and its generally better to spell out these things
anyway.
On 07/31/2015 04:39 AM, Peter Zijlstra wrote:
> On Wed, Jul 22, 2015 at 04:12:36PM -0400, Waiman Long wrote:
>> The smp_store_release() is not a full barrier. In order to avoid missed
>> wakeup, we may need to add memory barrier around locked and cpu state
>> variables adding to complexity. As the chance of spurious wakeup is very
>> low, it is easier and safer to just do an unconditional kick at unlock
>> time.
> This needs to better spell out the race; my still sleeping brain doesn't
> want to co-operate and its generally better to spell out these things
> anyway.
Sure, I will add a comment to talk about the possible race.
Cheers,
Longman