v1->v2:
- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.
This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.
Patch 1 simplifies the unlock code by doing unconditional vCPU kick
when _Q_SLOW_VAL is set as the chance of spurious wakeup showing
up in the statistical data that I collected was very low (1 or 2
occasionally).
Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.
Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled. This will slow the system a little bit.
Patch 4 enables multiple vCPU kicks at unlock time, outside of the
critical section. Coupled with patch 5 which defers kicking to unlock
time, this will improve system performance in overcommitted guests
and sometime even in normal guests.
Patch 6 enables adaptive spinning in the queue nodes. This patch can lead
to pretty big performance increase in over-committed guest at the expense
of a slight performance hit in normal guests.
Waiman Long (6):
locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
locking/pvqspinlock: Add pending bit support
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock: Allow vCPUs kick-ahead
locking/pvqspinlock: Opportunistically defer kicking to unlock time
locking/pvqspinlock: Queue node adaptive spinning
arch/x86/Kconfig | 7 +
kernel/locking/qspinlock.c | 38 +++-
kernel/locking/qspinlock_paravirt.h | 452 +++++++++++++++++++++++++++++++++--
3 files changed, 473 insertions(+), 24 deletions(-)
The smp_store_release() is not a full barrier. In order to avoid missed
wakeup, we may need to add memory barrier around locked and cpu state
variables adding to complexity. As the chance of spurious wakeup is very
low, it is easier and safer to just do an unconditional kick at unlock
time.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 04ab181..f2f4807 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -239,7 +239,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}
- WRITE_ONCE(pn->state, vcpu_halted);
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
@@ -311,8 +310,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
- pv_kick(node->cpu);
+ pv_kick(node->cpu);
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1
Like the native qspinlock, using the pending bit when it is lightly
loaded to acquire the lock is faster than going through the PV queuing
process which is even slower than the native queuing process. It also
avoids loading two additional cachelines (the MCS and PV nodes).
This patch adds the pending bit support for PV qspinlock. The pending
bit code has a smaller spin threshold (1<<10). It will default back
to the queuing method if it cannot acquired the lock within a certain
time limit.
On a VM with 32 vCPUs on a 32-core Westmere-EX box, the kernel
build times on 4.2-rc1 based kernels were:
Kernel Build Time Sys Time
------ ---------- --------
w/o patch 3m28.5s 28m17.5s
with patch 3m19.3s 23m55.7s
Using a locking microbenchmark on the same system, the locking
rates in (kops/s) were:
Threads Rate w/o patch Rate with patch
------- -------------- ---------------
2 (same socket) 6,515,265 7,077,476
2 (diff sockets) 2,967,145 4,353,851
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 27 ++++++++++++++-
kernel/locking/qspinlock_paravirt.h | 66 +++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+), 1 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..6518ee9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
}
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->pending, 0);
+}
+
/*
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
@@ -193,6 +204,15 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
}
/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL, &lock->val);
+}
+
+/**
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
* @tail : The new queue tail code word
@@ -245,6 +265,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }
#define pv_enabled() false
+#define pv_pending_lock(l, v) false
#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
@@ -286,8 +307,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
- if (pv_enabled())
+ if (pv_enabled()) {
+ if (pv_pending_lock(lock, val))
+ return; /* Got the lock via pending bit */
goto queue;
+ }
if (virt_queued_spin_lock(lock))
return;
@@ -463,6 +487,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_wait_node
#undef pv_kick_node
#undef pv_wait_head
+#undef pv_pending_lock
#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f2f4807..c43dec7 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,6 +21,14 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
+/*
+ * Queued Spinlock Spin Threshold
+ *
+ * The vCPU will spin a relatively short time in pending mode before falling
+ * back to queuing.
+ */
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
@@ -151,6 +159,64 @@ static void pv_init_node(struct mcs_spinlock *node)
}
/*
+ * Try to acquire the lock and wait using the pending bit
+ */
+static int pv_pending_lock(struct qspinlock *lock, u32 val)
+{
+ int loop = PENDING_SPIN_THRESHOLD;
+ u32 new, old;
+
+ /*
+ * wait for in-progress pending->locked hand-overs
+ */
+ if (val == _Q_PENDING_VAL) {
+ while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
+ loop--)
+ cpu_relax();
+ }
+
+ /*
+ * trylock || pending
+ */
+ for (;;) {
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+ if (loop-- <= 0)
+ goto queue;
+ }
+
+ if (new == _Q_LOCKED_VAL)
+ goto gotlock;
+ /*
+ * We are pending, wait for the owner to go away.
+ */
+ while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
+ && (loop-- > 0))
+ cpu_relax();
+
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ goto gotlock;
+ }
+ /*
+ * Clear the pending bit and fall back to queuing
+ */
+ clear_pending(lock);
+
+queue:
+ return 0;
+
+gotlock:
+ return 1;
+}
+
+/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to wake the vcpu again.
*/
--
1.7.1
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.
The measured latencies for different CPUs are:
CPU Wakeup Kicking
--- ------ -------
Haswell-EX 89.8us 7.4us
Westmere-EX 67.6us 9.3us
The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.
A sample of statistics counts after a kernel build (no CPU overcommit)
was:
hash_count=93
hash_hops_count=93
kick_latencies=6997535456
kick_time_count=755426
lock_kick_count=755354
pending_fail_count=10619
pending_lock_count=2912098
spurious_wakeup=8
unlock_kick_count=87
wait_head_count=91
wait_node_count=755359
wake_latencies=51052626984
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++
kernel/locking/qspinlock_paravirt.h | 165 ++++++++++++++++++++++++++++++++++-
2 files changed, 170 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..299a1c4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -679,6 +679,13 @@ config PARAVIRT_SPINLOCKS
If you are unsure how to answer this question, answer Y.
+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+
source "arch/x86/xen/Kconfig"
config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c43dec7..c8485c4 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -43,6 +43,153 @@ struct pv_node {
};
/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_kick_time,
+ pvstat_lock_kick,
+ pvstat_unlock_kick,
+ pvstat_pend_lock,
+ pvstat_pend_fail,
+ pvstat_spurious,
+ pvstat_hash,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_kick_time] = "kick_time_count",
+ [pvstat_lock_kick] = "lock_kick_count",
+ [pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_pend_lock] = "pending_lock_count",
+ [pvstat_pend_fail] = "pending_fail_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hash] = "hash_count",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg wake latency = pv_wake_latencies/kick_time_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_inc(&pvstats[pvstat_hash]);
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+ *pkick_time = 0;
+ pv_wait(ptr, val);
+ if (*pkick_time) {
+ atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_time);
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -102,10 +249,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;
for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -208,11 +358,13 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
* Clear the pending bit and fall back to queuing
*/
clear_pending(lock);
+ pvstat_inc(pvstat_pend_fail);
queue:
return 0;
gotlock:
+ pvstat_inc(pvstat_pend_lock);
return 1;
}
@@ -243,14 +395,18 @@ static void pv_wait_node(struct mcs_spinlock *node)
*/
smp_store_mb(pn->state, vcpu_halted);
- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node);
pv_wait(&pn->state, vcpu_halted);
+ }
/*
* Reset the vCPU state to avoid unncessary CPU kicking
*/
WRITE_ONCE(pn->state, vcpu_running);
+ if (READ_ONCE(node->locked))
+ break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -258,6 +414,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious);
}
/*
* By now our node->locked should be 1 and our caller will not actually
@@ -283,8 +440,10 @@ static void pv_kick_node(struct mcs_spinlock *node)
*
* See the comment in pv_wait_node().
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
+ if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
+ pvstat_inc(pvstat_lock_kick);
pv_kick(pn->cpu);
+ }
}
/*
@@ -326,6 +485,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head);
pv_wait(&l->locked, _Q_SLOW_VAL);
/*
@@ -376,6 +536,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
*/
+ pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
}
/*
--
1.7.1
Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
critical section and block forward progress. This patch implements
a kick-ahead mechanism where the unlocker will kick the queue head
vCPUs as well as up to four additional vCPUs next to the queue head
if they were halted. The kickings are done after exiting the critical
section to improve parallelism.
The amount of kick-ahead allowed depends on the number of vCPUs
in the VM guest. This patch, by itself, won't do much as most of
the kickings are currently done at lock time. Coupled with the next
patch that defers lock time kicking to unlock time, it should improve
overall system performance in a busy overcommitted guest.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m25.0s 10m34.1s 2m02.0s 15m35.9s
After patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
There wasn't too much difference before and after the patch.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 77 +++++++++++++++++++++++++++++++++-
1 files changed, 74 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c8485c4..f3ceeff 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -51,6 +51,7 @@ enum pv_qlock_stat {
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
+ pvstat_kick_ahead,
pvstat_pend_lock,
pvstat_pend_fail,
pvstat_spurious,
@@ -72,6 +73,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_kick_ahead] = "kick_ahead_count",
[pvstat_pend_lock] = "pending_lock_count",
[pvstat_pend_fail] = "pending_fail_count",
[pvstat_spurious] = "spurious_wakeup",
@@ -85,7 +87,8 @@ static atomic_t pvstats[pvstat_num];
* pv_kick_latencies = sum of all pv_kick latencies in ns
* pv_wake_latencies = sum of all wakeup latencies in ns
*
- * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg kick latency = pv_kick_latencies/
+ * (lock_kick_count + unlock_kick_count + kick_ahead_count)
* Avg wake latency = pv_wake_latencies/kick_time_count
*/
static atomic64_t pv_kick_latencies, pv_wake_latencies;
@@ -217,6 +220,12 @@ static struct pv_hash_entry *pv_lock_hash;
static unsigned int pv_lock_hash_bits __read_mostly;
/*
+ * Allow kick-ahead of vCPUs at unlock time
+ */
+#define PV_KICK_AHEAD_MAX 4
+static int pv_kick_ahead __read_mostly;
+
+/*
* Allocate memory for the PV qspinlock hash buckets
*
* This function should be called from the paravirt spinlock initialization
@@ -224,7 +233,16 @@ static unsigned int pv_lock_hash_bits __read_mostly;
*/
void __init __pv_init_lock_hash(void)
{
- int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+ int ncpus = num_possible_cpus();
+ int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);
+ int i;
+
+ /*
+ * The minimum number of vCPUs required in each kick-ahead level
+ */
+ static const u8 kick_ahead_threshold[PV_KICK_AHEAD_MAX] = {
+ 4, 8, 16, 32
+ };
if (pv_hash_size < PV_HE_MIN)
pv_hash_size = PV_HE_MIN;
@@ -238,6 +256,18 @@ void __init __pv_init_lock_hash(void)
pv_hash_size, 0, HASH_EARLY,
&pv_lock_hash_bits, NULL,
pv_hash_size, pv_hash_size);
+ /*
+ * Enable the unlock kick ahead mode according to the number of
+ * vCPUs available.
+ */
+ for (i = PV_KICK_AHEAD_MAX; i > 0; i--)
+ if (ncpus >= kick_ahead_threshold[i - 1]) {
+ pv_kick_ahead = i;
+ break;
+ }
+ if (pv_kick_ahead)
+ pr_info("PV unlock kick ahead level %d enabled\n",
+ pv_kick_ahead);
}
#define for_each_hash_entry(he, offset, hash) \
@@ -424,6 +454,25 @@ static void pv_wait_node(struct mcs_spinlock *node)
}
/*
+ * Helper to get the address of the next kickable node
+ * The node has to be in the halted state and is being transitioned to
+ * running state by this function. Otherwise, NULL will be returned.
+ */
+static inline struct pv_node *pv_get_kick_node(struct pv_node *node)
+{
+ struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
+
+ if (!next)
+ return NULL;
+
+ if ((READ_ONCE(next->state) != vcpu_halted) ||
+ (xchg(&next->state, vcpu_running) != vcpu_halted))
+ next = NULL; /* No kicking is needed */
+
+ return next;
+}
+
+/*
* Called after setting next->locked = 1, used to wake those stuck in
* pv_wait_node().
*/
@@ -510,7 +559,8 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
{
struct __qspinlock *l = (void *)lock;
- struct pv_node *node;
+ struct pv_node *node, *nxt, *next[PV_KICK_AHEAD_MAX];
+ int i, nr_kick;
/*
* We must not unlock if SLOW, because in that case we must first
@@ -527,6 +577,19 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
node = pv_unhash(lock);
/*
+ * Implement unlock kick-ahead
+ *
+ * Access the next group of nodes, if available, and prepare to kick
+ * them after releasing the lock if they are in the halted state. This
+ * should improve performance on an overcommitted system.
+ */
+ for (nr_kick = 0, nxt = node; nr_kick < pv_kick_ahead; nr_kick++) {
+ nxt = next[nr_kick] = pv_get_kick_node(nxt);
+ if (!nxt)
+ break;
+ }
+
+ /*
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
@@ -538,6 +601,14 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
*/
pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
+
+ /*
+ * Kick the next group of vCPUs, if available.
+ */
+ for (i = 0; i < nr_kick; i++) {
+ pvstat_inc(pvstat_kick_ahead);
+ pv_kick(next[i]->cpu);
+ }
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1
Performing CPU kicking at lock time can be a bit faster if there
is no kick-ahead. On the other hand, deferring it to unlock time is
preferrable when kick-ahead can be performed or when the VM guest is
having too few vCPUs that a vCPU may be kicked twice before getting
the lock. This patch implements the deferring kicking when either
one of the above 2 conditions is true.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
After patch 3m01.3s 9m50.9s 2m00.5s 13m28.1s
On Westmere, both 32/48 vCPUs case showed some sizeable increase
in performance. For Haswell, there was some improvement in the
overcommitted (48 vCPUs) case.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 6 +-
kernel/locking/qspinlock_paravirt.h | 71 +++++++++++++++++++++++++++--------
2 files changed, 58 insertions(+), 19 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6518ee9..94fdd27 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -259,8 +259,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }
@@ -464,7 +464,7 @@ queue:
cpu_relax();
arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);
release:
/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f3ceeff..1f0485c 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -40,6 +40,7 @@ struct pv_node {
int cpu;
u8 state;
+ u8 hashed; /* Set if in hashed table */
};
/*
@@ -336,6 +337,7 @@ static void pv_init_node(struct mcs_spinlock *node)
pn->cpu = smp_processor_id();
pn->state = vcpu_running;
+ pn->hashed = 0;
}
/*
@@ -455,18 +457,21 @@ static void pv_wait_node(struct mcs_spinlock *node)
/*
* Helper to get the address of the next kickable node
- * The node has to be in the halted state and is being transitioned to
- * running state by this function. Otherwise, NULL will be returned.
+ *
+ * The node has to be in the halted state. If the chkonly flag is set,
+ * the CPU state won't be changed. Otherwise, it will be transitioned to
+ * running state by this function. If no kickable node is found, NULL
+ * will be returned.
*/
-static inline struct pv_node *pv_get_kick_node(struct pv_node *node)
+static inline struct pv_node *
+pv_get_kick_node(struct pv_node *node, int chkonly)
{
struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
- if (!next)
+ if (!next || (READ_ONCE(next->state) != vcpu_halted))
return NULL;
- if ((READ_ONCE(next->state) != vcpu_halted) ||
- (xchg(&next->state, vcpu_running) != vcpu_halted))
+ if (!chkonly && (xchg(&next->state, vcpu_running) != vcpu_halted))
next = NULL; /* No kicking is needed */
return next;
@@ -476,23 +481,50 @@ static inline struct pv_node *pv_get_kick_node(struct pv_node *node)
* Called after setting next->locked = 1, used to wake those stuck in
* pv_wait_node().
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ if (xchg(&pn->state, vcpu_running) == vcpu_running)
+ return;
+
/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * Kicking the next node at lock time can actually be a bit faster
+ * than doing it at unlock time because the critical section time
+ * overlaps with the wakeup latency of the next node. However, if the
+ * VM is too overcommmitted, it can happen that we need to kick the
+ * CPU again at unlock time (double-kick). To avoid that and also to
+ * fully utilize the kick-ahead functionality at unlock time,
+ * the kicking will be deferred under either one of the following
+ * 2 conditions:
*
- * This should be fine however, kicking people for no reason is
- * harmless.
+ * 1) The VM guest has too few vCPUs that kick-ahead is not even
+ * enabled. In this case, the chance of double-kick will be
+ * higher.
+ * 2) The node after the next one is also in the halted state.
*
- * See the comment in pv_wait_node().
+ * In this case, the hashed flag is set to indicate that hashed
+ * table has been filled and _Q_SLOW_VAL is set.
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
- pvstat_inc(pvstat_lock_kick);
- pv_kick(pn->cpu);
+ if ((!pv_kick_ahead || pv_get_kick_node(pn, 1)) &&
+ (xchg(&pn->hashed, 1) == 0)) {
+ struct __qspinlock *l = (void *)lock;
+
+ /*
+ * As this is the same vCPU that will check the _Q_SLOW_VAL
+ * value and the hash table later on at unlock time, no atomic
+ * instruction is needed.
+ */
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
+ return;
}
+
+ /*
+ * Kicking the vCPU even if it is not really halted is safe.
+ */
+ pvstat_inc(pvstat_lock_kick);
+ pv_kick(pn->cpu);
}
/*
@@ -513,6 +545,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}
+ if (!lp && (xchg(&pn->hashed, 1) == 1))
+ /*
+ * The hashed table & _Q_SLOW_VAL had been filled
+ * by the lock holder.
+ */
+ lp = (struct qspinlock **)-1;
+
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
@@ -584,7 +623,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* should improve performance on an overcommitted system.
*/
for (nr_kick = 0, nxt = node; nr_kick < pv_kick_ahead; nr_kick++) {
- nxt = next[nr_kick] = pv_get_kick_node(nxt);
+ nxt = next[nr_kick] = pv_get_kick_node(nxt, 0);
if (!nxt)
break;
}
--
1.7.1
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.
Reducing the spinning threshold is found to improve performance in
an overcommitted VM guest, but decrease performance when there is
no overcommittment.
This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() earlier under the following 3 conditions:
1) the previous vCPU is in the halted state;
2) the current vCPU is at least 2 nodes away from the lock holder;
3) there are a lot of pv_wait() for the current vCPU recently.
If all the conditions are true, it will spin less before calling
pv_wait().
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m01.3s 9m50.9s 2m00.5s 13m28.1s
After patch 3m06.2s 9m38.0s 2m01.8s 9m06.9s
This patch seemed to cause a little bit of performance degraduation
for 32 vCPUs. For 48 vCPUs, there was some performance for Westmere and
a big performance jump for Haswell.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 5 +-
kernel/locking/qspinlock_paravirt.h | 93 ++++++++++++++++++++++++++++++++--
2 files changed, 90 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 94fdd27..da39d43 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -258,7 +258,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline void __pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
@@ -415,7 +416,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 1f0485c..5f97058 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -22,12 +22,37 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
/*
- * Queued Spinlock Spin Threshold
+ * Queued Spinlock Spin Thresholds
*
* The vCPU will spin a relatively short time in pending mode before falling
* back to queuing.
+ *
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will spin less if the following conditions are true:
+ * 1) vCPU in the previous node is halted
+ * 2) it is at least 2 nodes away from the lock holder
+ * 3) there is a lot of pv_wait() in the curent vCPU recently
+ *
+ * The last condition is being monitored by the waithist field in the pv_node
+ * structure which tracks the history of pv_wait() relative to slowpath calls.
+ * Each pv_wait will increment this field by PV_WAIT_INC until it exceeds
+ * PV_WAITHIST_MAX. Each slowpath lock call will decrement it by 1 until it
+ * reaches PV_WAITHIST_MIN. If its value is higher than PV_WAITHIST_THRESHOLD,
+ * the vCPU will spin less. The reason for this adaptive spinning is to try
+ * to enable wait-early when overcommitted, but don't use it when it is not.
+ *
+ * The queue node vCPU will monitor the state of the previous node
+ * periodically to see if there is any change.
*/
-#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD
+#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_CHECK_MASK 0xff
+#define PV_WAIT_INC 2
+#define PV_WAITHIST_MIN 1
+#define PV_WAITHIST_MAX 30
+#define PV_WAITHIST_THRESHOLD 15
enum vcpu_state {
vcpu_running = 0,
@@ -41,6 +66,7 @@ struct pv_node {
int cpu;
u8 state;
u8 hashed; /* Set if in hashed table */
+ u8 waithist;
};
/*
@@ -49,6 +75,7 @@ struct pv_node {
enum pv_qlock_stat {
pvstat_wait_head,
pvstat_wait_node,
+ pvstat_wait_early,
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
@@ -71,6 +98,7 @@ enum pv_qlock_stat {
static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_head] = "wait_head_count",
[pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_early] = "wait_early_count",
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
@@ -402,19 +430,67 @@ gotlock:
/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_kick_node() is used to wake the vcpu again, but the kicking may also
+ * be deferred to the unlock time.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
+ bool wait_early, can_wait_early;
int loop;
for (;;) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ /*
+ * Spin less if the previous vCPU was in the halted state
+ * and it is not the queue head.
+ */
+ can_wait_early = (pn->waithist > PV_WAITHIST_THRESHOLD);
+ wait_early = can_wait_early && !READ_ONCE(prev->locked) &&
+ (READ_ONCE(pp->state) == vcpu_halted);
+ loop = wait_early ? QNODE_SPIN_THRESHOLD_SHORT
+ : QNODE_SPIN_THRESHOLD;
+ for (; loop; loop--, cpu_relax()) {
+ bool halted;
+
if (READ_ONCE(node->locked))
return;
- cpu_relax();
+
+ if (!can_wait_early || (loop & QNODE_SPIN_CHECK_MASK))
+ continue;
+
+ /*
+ * Look for state transition at previous node.
+ *
+ * running => halted:
+ * call pv_wait() now if kick-ahead is enabled
+ * or reduce spin threshold to
+ * QNODE_SPIN_THRESHOLD_SHORT or less.
+ * halted => running:
+ * reset spin threshold to QNODE_SPIN_THRESHOLD
+ */
+ halted = (READ_ONCE(pp->state) == vcpu_halted) &&
+ !READ_ONCE(prev->locked);
+ if (wait_early == halted)
+ continue;
+ wait_early = halted;
+
+ if (!wait_early)
+ loop = QNODE_SPIN_THRESHOLD;
+ else if (pv_kick_ahead)
+ break;
+ else if (loop > QNODE_SPIN_THRESHOLD_SHORT)
+ loop = QNODE_SPIN_THRESHOLD_SHORT;
}
+ if (wait_early)
+ pvstat_inc(pvstat_wait_early);
+
+ /*
+ * A pv_wait while !wait_early has higher weight than when
+ * wait_early is true.
+ */
+ if (pn->waithist < PV_WAITHIST_MAX)
+ pn->waithist += wait_early ? 1 : PV_WAIT_INC;
/*
* Order pn->state vs pn->locked thusly:
@@ -538,6 +614,9 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct qspinlock **lp = NULL;
int loop;
+ if (pn->waithist > PV_WAITHIST_MIN)
+ pn->waithist--; /* Pre-decremnt the waithist field */
+
for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -573,6 +652,8 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ if (pn->waithist < PV_WAITHIST_MAX)
+ pn->waithist += PV_WAIT_INC;
pvstat_inc(pvstat_wait_head);
pv_wait(&l->locked, _Q_SLOW_VAL);
--
1.7.1
On 07/15/2015 07:43 AM, Waiman Long wrote:
> Performing CPU kicking at lock time can be a bit faster if there
> is no kick-ahead. On the other hand, deferring it to unlock time is
> preferrable when kick-ahead can be performed or when the VM guest is
> having too few vCPUs that a vCPU may be kicked twice before getting
> the lock. This patch implements the deferring kicking when either
> one of the above 2 conditions is true.
>
> Linux kernel builds were run in KVM guest on an 8-socket, 4
> cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
> Haswell-EX system. Both systems are configured to have 32 physical
> CPUs. The kernel build times before and after the patch were:
>
> Westmere Haswell
> Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
> ----- -------- -------- -------- --------
> Before patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
> After patch 3m01.3s 9m50.9s 2m00.5s 13m28.1s
>
> On Westmere, both 32/48 vCPUs case showed some sizeable increase
> in performance. For Haswell, there was some improvement in the
> overcommitted (48 vCPUs) case.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
Hi Waiman,
For virtual guests, from my experiments, lock waiter preemption was the
main concern especially with fair locks.
I find that these set of patches are in right direction to address them.
Thanks for the patches.
On Tue, Jul 14, 2015 at 10:13:32PM -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid missed
> wakeup, we may need to add memory barrier around locked and cpu state
> variables adding to complexity. As the chance of spurious wakeup is very
> low, it is easier and safer to just do an unconditional kick at unlock
> time.
I have the below patch. We need that rmb in there anyhow for the hash to
work.
---
Subject: locking/pvqspinlock: Order pv_unhash after cmpxchg on unlock slowpath
From: Will Deacon <[email protected]>
Date: Mon, 13 Jul 2015 16:58:30 +0100
When we unlock in __pv_queued_spin_unlock, a failed cmpxchg on the lock
value indicates that we need to take the slow-path and unhash the
corresponding node blocked on the lock.
Since a failed cmpxchg does not provide any memory-ordering guarantees,
it is possible that the node data could be read before the cmpxchg on
weakly-ordered architectures and therefore return a stale value, leading
to hash corruption and/or a BUG().
This patch adds an smb_rmb() following the failed cmpxchg operation, so
that the unhashing is ordered after the lock has been checked.
Cc: Paul McKenney <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Steve Capper <[email protected]>
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
[peterz: More comments]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/locking/qspinlock_paravirt.h | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -244,13 +244,17 @@ static void pv_wait_head(struct qspinloc
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
- * lp must be set before setting _Q_SLOW_VAL
+ * We must hash before setting _Q_SLOW_VAL, such that
+ * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
+ * we'll be sure to be able to observe our hash entry.
*
- * [S] lp = lock [RmW] l = l->locked = 0
- * MB MB
- * [S] l->locked = _Q_SLOW_VAL [L] lp
+ * [S] pn->state
+ * [S] <hash> [Rmw] l->locked == _Q_SLOW_VAL
+ * MB RMB
+ * [RmW] l->locked = _Q_SLOW_VAL [L] <unhash>
+ * [L] pn->state
*
- * Matches the cmpxchg() in __pv_queued_spin_unlock().
+ * Matches the smp_rmb() in __pv_queued_spin_unlock().
*/
if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
/*
@@ -306,6 +310,15 @@ __visible void __pv_queued_spin_unlock(s
}
/*
+ * A failed cmpxchg doesn't provide any memory-ordering guarantees,
+ * so we need a barrier to order the read of the node data in
+ * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
+ *
+ * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
+ */
+ smp_rmb();
+
+ /*
* Since the above failed to release, this must be the SLOW path.
* Therefore start by looking up the blocked node and unhashing it.
*/
On Tue, Jul 14, 2015 at 10:13:35PM -0400, Waiman Long wrote:
> Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
> critical section and block forward progress. This patch implements
> a kick-ahead mechanism where the unlocker will kick the queue head
> vCPUs as well as up to four additional vCPUs next to the queue head
> if they were halted. The kickings are done after exiting the critical
> section to improve parallelism.
>
> The amount of kick-ahead allowed depends on the number of vCPUs
> in the VM guest. This patch, by itself, won't do much as most of
> the kickings are currently done at lock time. Coupled with the next
> patch that defers lock time kicking to unlock time, it should improve
> overall system performance in a busy overcommitted guest.
>
> Linux kernel builds were run in KVM guest on an 8-socket, 4
> cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
> Haswell-EX system. Both systems are configured to have 32 physical
> CPUs. The kernel build times before and after the patch were:
>
> Westmere Haswell
> Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
> ----- -------- -------- -------- --------
> Before patch 3m25.0s 10m34.1s 2m02.0s 15m35.9s
> After patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
>
> There wasn't too much difference before and after the patch.
That means either the patch isn't worth it, or as you seem to imply its
in the wrong place in this series.
> @@ -224,7 +233,16 @@ static unsigned int pv_lock_hash_bits __read_mostly;
> */
> void __init __pv_init_lock_hash(void)
> {
> - int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
> + int ncpus = num_possible_cpus();
> + int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);
> + int i;
> +
> + /*
> + * The minimum number of vCPUs required in each kick-ahead level
> + */
> + static const u8 kick_ahead_threshold[PV_KICK_AHEAD_MAX] = {
> + 4, 8, 16, 32
> + };
You are aware we have ilog2(), right?
> + /*
> + * Enable the unlock kick ahead mode according to the number of
> + * vCPUs available.
> + */
> + for (i = PV_KICK_AHEAD_MAX; i > 0; i--)
> + if (ncpus >= kick_ahead_threshold[i - 1]) {
> + pv_kick_ahead = i;
> + break;
> + }
That's missing { }.
> + if (pv_kick_ahead)
> + pr_info("PV unlock kick ahead level %d enabled\n",
> + pv_kick_ahead);
Idem.
That said, I still really dislike this patch, it again seems a random
bunch of hacks.
You also do not offer any support for any of the magic numbers..
On Tue, Jul 14, 2015 at 10:13:37PM -0400, Waiman Long wrote:
> +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
> {
> struct pv_node *pn = (struct pv_node *)node;
> + struct pv_node *pp = (struct pv_node *)prev;
> + bool wait_early, can_wait_early;
> int loop;
>
> for (;;) {
> - for (loop = SPIN_THRESHOLD; loop; loop--) {
> + /*
> + * Spin less if the previous vCPU was in the halted state
> + * and it is not the queue head.
> + */
> + can_wait_early = (pn->waithist > PV_WAITHIST_THRESHOLD);
> + wait_early = can_wait_early && !READ_ONCE(prev->locked) &&
> + (READ_ONCE(pp->state) == vcpu_halted);
> + loop = wait_early ? QNODE_SPIN_THRESHOLD_SHORT
> + : QNODE_SPIN_THRESHOLD;
> + for (; loop; loop--, cpu_relax()) {
> + bool halted;
> +
> if (READ_ONCE(node->locked))
> return;
> - cpu_relax();
> +
> + if (!can_wait_early || (loop & QNODE_SPIN_CHECK_MASK))
> + continue;
> +
> + /*
> + * Look for state transition at previous node.
> + *
> + * running => halted:
> + * call pv_wait() now if kick-ahead is enabled
> + * or reduce spin threshold to
> + * QNODE_SPIN_THRESHOLD_SHORT or less.
> + * halted => running:
> + * reset spin threshold to QNODE_SPIN_THRESHOLD
> + */
> + halted = (READ_ONCE(pp->state) == vcpu_halted) &&
> + !READ_ONCE(prev->locked);
> + if (wait_early == halted)
> + continue;
> + wait_early = halted;
> +
> + if (!wait_early)
> + loop = QNODE_SPIN_THRESHOLD;
> + else if (pv_kick_ahead)
> + break;
> + else if (loop > QNODE_SPIN_THRESHOLD_SHORT)
> + loop = QNODE_SPIN_THRESHOLD_SHORT;
> }
> + if (wait_early)
> + pvstat_inc(pvstat_wait_early);
> +
> + /*
> + * A pv_wait while !wait_early has higher weight than when
> + * wait_early is true.
> + */
> + if (pn->waithist < PV_WAITHIST_MAX)
> + pn->waithist += wait_early ? 1 : PV_WAIT_INC;
So when you looked at this patch, you didn't go like, OMFG!?
So what was wrong with something like:
static inline int pv_spin_threshold(struct pv_node *prev)
{
if (READ_ONCE(prev->locked)) /* it can run, wait for it */
return SPIN_THRESHOLD;
if (READ_ONCE(prev->state) == vcpu_halted) /* its not running, do not wait */
return 1;
return SPIN_THRESHOLD;
}
static void pv_wait_head(...)
{
for (;;) {
for (loop = pv_spin_threshold(pp); loop; loop--) {
if (READ_ONCE(node->locked))
return;
cpu_relax();
}
if (!lp) {
...
}
pv_wait(&l->locked, _Q_SLOW_VAL);
}
}
What part of: "keep it simple" and "gradual complexity" have you still
not grasped?
On Tue, Jul 14, 2015 at 10:13:36PM -0400, Waiman Long wrote:
> +static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> {
> struct pv_node *pn = (struct pv_node *)node;
>
> + if (xchg(&pn->state, vcpu_running) == vcpu_running)
> + return;
> +
> /*
> + * Kicking the next node at lock time can actually be a bit faster
> + * than doing it at unlock time because the critical section time
> + * overlaps with the wakeup latency of the next node. However, if the
> + * VM is too overcommmitted, it can happen that we need to kick the
> + * CPU again at unlock time (double-kick). To avoid that and also to
> + * fully utilize the kick-ahead functionality at unlock time,
> + * the kicking will be deferred under either one of the following
> + * 2 conditions:
> *
> + * 1) The VM guest has too few vCPUs that kick-ahead is not even
> + * enabled. In this case, the chance of double-kick will be
> + * higher.
> + * 2) The node after the next one is also in the halted state.
> *
> + * In this case, the hashed flag is set to indicate that hashed
> + * table has been filled and _Q_SLOW_VAL is set.
> */
> - if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
> - pvstat_inc(pvstat_lock_kick);
> - pv_kick(pn->cpu);
> + if ((!pv_kick_ahead || pv_get_kick_node(pn, 1)) &&
> + (xchg(&pn->hashed, 1) == 0)) {
> + struct __qspinlock *l = (void *)lock;
> +
> + /*
> + * As this is the same vCPU that will check the _Q_SLOW_VAL
> + * value and the hash table later on at unlock time, no atomic
> + * instruction is needed.
> + */
> + WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> + (void)pv_hash(lock, pn);
> + return;
> }
> +
> + /*
> + * Kicking the vCPU even if it is not really halted is safe.
> + */
> + pvstat_inc(pvstat_lock_kick);
> + pv_kick(pn->cpu);
> }
>
> /*
> @@ -513,6 +545,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> cpu_relax();
> }
>
> + if (!lp && (xchg(&pn->hashed, 1) == 1))
> + /*
> + * The hashed table & _Q_SLOW_VAL had been filled
> + * by the lock holder.
> + */
> + lp = (struct qspinlock **)-1;
> +
> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> /*
*groan*, so you complained the previous version of this patch was too
complex, but let me say I vastly preferred it to this one :/
On 07/15/2015 05:10 AM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 10:13:32PM -0400, Waiman Long wrote:
>> The smp_store_release() is not a full barrier. In order to avoid missed
>> wakeup, we may need to add memory barrier around locked and cpu state
>> variables adding to complexity. As the chance of spurious wakeup is very
>> low, it is easier and safer to just do an unconditional kick at unlock
>> time.
> I have the below patch. We need that rmb in there anyhow for the hash to
> work.
>
> ---
> Subject: locking/pvqspinlock: Order pv_unhash after cmpxchg on unlock slowpath
> From: Will Deacon<[email protected]>
> Date: Mon, 13 Jul 2015 16:58:30 +0100
>
> When we unlock in __pv_queued_spin_unlock, a failed cmpxchg on the lock
> value indicates that we need to take the slow-path and unhash the
> corresponding node blocked on the lock.
>
> Since a failed cmpxchg does not provide any memory-ordering guarantees,
> it is possible that the node data could be read before the cmpxchg on
> weakly-ordered architectures and therefore return a stale value, leading
> to hash corruption and/or a BUG().
>
> This patch adds an smb_rmb() following the failed cmpxchg operation, so
> that the unhashing is ordered after the lock has been checked.
>
> Cc: Paul McKenney<[email protected]>
> Cc: Waiman Long<[email protected]>
> Cc: Steve Capper<[email protected]>
> Reported-by: Peter Zijlstra<[email protected]>
> Signed-off-by: Will Deacon<[email protected]>
> [peterz: More comments]
> Signed-off-by: Peter Zijlstra (Intel)<[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> kernel/locking/qspinlock_paravirt.h | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -244,13 +244,17 @@ static void pv_wait_head(struct qspinloc
> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> /*
> - * lp must be set before setting _Q_SLOW_VAL
> + * We must hash before setting _Q_SLOW_VAL, such that
> + * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
> + * we'll be sure to be able to observe our hash entry.
> *
> - * [S] lp = lock [RmW] l = l->locked = 0
> - * MB MB
> - * [S] l->locked = _Q_SLOW_VAL [L] lp
> + * [S] pn->state
> + * [S]<hash> [Rmw] l->locked == _Q_SLOW_VAL
> + * MB RMB
> + * [RmW] l->locked = _Q_SLOW_VAL [L]<unhash>
> + * [L] pn->state
> *
> - * Matches the cmpxchg() in __pv_queued_spin_unlock().
> + * Matches the smp_rmb() in __pv_queued_spin_unlock().
> */
> if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
> /*
> @@ -306,6 +310,15 @@ __visible void __pv_queued_spin_unlock(s
> }
>
> /*
> + * A failed cmpxchg doesn't provide any memory-ordering guarantees,
> + * so we need a barrier to order the read of the node data in
> + * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> + *
> + * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
> + */
> + smp_rmb();
According to memory_barriers.txt, cmpxchg() is a full memory barrier. It
didn't say a failed cmpxchg will lose its memory guarantee. So is the
documentation right? Or is that true for some architectures? I think it
is not true for x86.
Cheers,
Longman
On 07/15/2015 05:39 AM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 10:13:35PM -0400, Waiman Long wrote:
>> Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
>> critical section and block forward progress. This patch implements
>> a kick-ahead mechanism where the unlocker will kick the queue head
>> vCPUs as well as up to four additional vCPUs next to the queue head
>> if they were halted. The kickings are done after exiting the critical
>> section to improve parallelism.
>>
>> The amount of kick-ahead allowed depends on the number of vCPUs
>> in the VM guest. This patch, by itself, won't do much as most of
>> the kickings are currently done at lock time. Coupled with the next
>> patch that defers lock time kicking to unlock time, it should improve
>> overall system performance in a busy overcommitted guest.
>>
>> Linux kernel builds were run in KVM guest on an 8-socket, 4
>> cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
>> Haswell-EX system. Both systems are configured to have 32 physical
>> CPUs. The kernel build times before and after the patch were:
>>
>> Westmere Haswell
>> Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
>> ----- -------- -------- -------- --------
>> Before patch 3m25.0s 10m34.1s 2m02.0s 15m35.9s
>> After patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
>>
>> There wasn't too much difference before and after the patch.
> That means either the patch isn't worth it, or as you seem to imply its
> in the wrong place in this series.
It needs to be coupled with the next patch to be effective as most of
the kicking are happening at the lock side, instead of at the unlock
side. If you look at the sample pvqspinlock stats in patch 3:
lock_kick_count=755354
unlock_kick_count=87
The number of unlock kicks is negligible compared with the lock kicks.
Patch 5 does have a dependency on patch 4 unless we make it
unconditionally defers kicking to the unlock call which was what I had
done in the v1 patch. The reason why I change this in v2 is because I
found a very slight performance degradation in doing so.
>> @@ -224,7 +233,16 @@ static unsigned int pv_lock_hash_bits __read_mostly;
>> */
>> void __init __pv_init_lock_hash(void)
>> {
>> - int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
>> + int ncpus = num_possible_cpus();
>> + int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);
>> + int i;
>> +
>> + /*
>> + * The minimum number of vCPUs required in each kick-ahead level
>> + */
>> + static const u8 kick_ahead_threshold[PV_KICK_AHEAD_MAX] = {
>> + 4, 8, 16, 32
>> + };
> You are aware we have ilog2(), right?
Right, I am not aware of ilog2(). I am going to use that instead.
>> + /*
>> + * Enable the unlock kick ahead mode according to the number of
>> + * vCPUs available.
>> + */
>> + for (i = PV_KICK_AHEAD_MAX; i> 0; i--)
>> + if (ncpus>= kick_ahead_threshold[i - 1]) {
>> + pv_kick_ahead = i;
>> + break;
>> + }
> That's missing { }.
>
>> + if (pv_kick_ahead)
>> + pr_info("PV unlock kick ahead level %d enabled\n",
>> + pv_kick_ahead);
> Idem.
Will fix the {} problems.
> That said, I still really dislike this patch, it again seems a random
> bunch of hacks.
Any suggestions to make it better suit your taste?
> You also do not offer any support for any of the magic numbers..
I chose 4 for PV_KICK_AHEAD_MAX as I didn't see much performance
difference when I did a kick-ahead of 5. Also, it may be too unfair to
the vCPU that was doing the kicking if the number is too big. Another
magic number is pv_kick_ahead number. This one is kind of arbitrary.
Right now I do a log2, but it can be divided by 4 (rshift 2) as well.
Cheers,
Longman
On 07/15/2015 06:01 AM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 10:13:37PM -0400, Waiman Long wrote:
>> +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>> {
>> struct pv_node *pn = (struct pv_node *)node;
>> + struct pv_node *pp = (struct pv_node *)prev;
>> + bool wait_early, can_wait_early;
>> int loop;
>>
>> for (;;) {
>> - for (loop = SPIN_THRESHOLD; loop; loop--) {
>> + /*
>> + * Spin less if the previous vCPU was in the halted state
>> + * and it is not the queue head.
>> + */
>> + can_wait_early = (pn->waithist> PV_WAITHIST_THRESHOLD);
>> + wait_early = can_wait_early&& !READ_ONCE(prev->locked)&&
>> + (READ_ONCE(pp->state) == vcpu_halted);
>> + loop = wait_early ? QNODE_SPIN_THRESHOLD_SHORT
>> + : QNODE_SPIN_THRESHOLD;
>> + for (; loop; loop--, cpu_relax()) {
>> + bool halted;
>> +
>> if (READ_ONCE(node->locked))
>> return;
>> - cpu_relax();
>> +
>> + if (!can_wait_early || (loop& QNODE_SPIN_CHECK_MASK))
>> + continue;
>> +
>> + /*
>> + * Look for state transition at previous node.
>> + *
>> + * running => halted:
>> + * call pv_wait() now if kick-ahead is enabled
>> + * or reduce spin threshold to
>> + * QNODE_SPIN_THRESHOLD_SHORT or less.
>> + * halted => running:
>> + * reset spin threshold to QNODE_SPIN_THRESHOLD
>> + */
>> + halted = (READ_ONCE(pp->state) == vcpu_halted)&&
>> + !READ_ONCE(prev->locked);
>> + if (wait_early == halted)
>> + continue;
>> + wait_early = halted;
>> +
>> + if (!wait_early)
>> + loop = QNODE_SPIN_THRESHOLD;
>> + else if (pv_kick_ahead)
>> + break;
>> + else if (loop> QNODE_SPIN_THRESHOLD_SHORT)
>> + loop = QNODE_SPIN_THRESHOLD_SHORT;
>> }
>> + if (wait_early)
>> + pvstat_inc(pvstat_wait_early);
>> +
>> + /*
>> + * A pv_wait while !wait_early has higher weight than when
>> + * wait_early is true.
>> + */
>> + if (pn->waithist< PV_WAITHIST_MAX)
>> + pn->waithist += wait_early ? 1 : PV_WAIT_INC;
> So when you looked at this patch, you didn't go like, OMFG!?
>
> So what was wrong with something like:
>
> static inline int pv_spin_threshold(struct pv_node *prev)
> {
> if (READ_ONCE(prev->locked)) /* it can run, wait for it */
> return SPIN_THRESHOLD;
>
> if (READ_ONCE(prev->state) == vcpu_halted) /* its not running, do not wait */
> return 1;
>
> return SPIN_THRESHOLD;
> }
>
> static void pv_wait_head(...)
> {
> for (;;) {
> for (loop = pv_spin_threshold(pp); loop; loop--) {
> if (READ_ONCE(node->locked))
> return;
> cpu_relax();
> }
>
> if (!lp) {
> ...
> }
> pv_wait(&l->locked, _Q_SLOW_VAL);
> }
> }
>
> What part of: "keep it simple" and "gradual complexity" have you still
> not grasped?
I confess that I was a bit sloppy in that part of the code. I want to
get it out for review ASAP without doing too much fine tuning as I
expect at least a few iterations for this patchset. I will certainly
change it in the new patch. Anyway, thanks for the great suggestion.
Cheers,
Longman
On 07/15/2015 06:03 AM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 10:13:36PM -0400, Waiman Long wrote:
>> +static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>> {
>> struct pv_node *pn = (struct pv_node *)node;
>>
>> + if (xchg(&pn->state, vcpu_running) == vcpu_running)
>> + return;
>> +
>> /*
>> + * Kicking the next node at lock time can actually be a bit faster
>> + * than doing it at unlock time because the critical section time
>> + * overlaps with the wakeup latency of the next node. However, if the
>> + * VM is too overcommmitted, it can happen that we need to kick the
>> + * CPU again at unlock time (double-kick). To avoid that and also to
>> + * fully utilize the kick-ahead functionality at unlock time,
>> + * the kicking will be deferred under either one of the following
>> + * 2 conditions:
>> *
>> + * 1) The VM guest has too few vCPUs that kick-ahead is not even
>> + * enabled. In this case, the chance of double-kick will be
>> + * higher.
>> + * 2) The node after the next one is also in the halted state.
>> *
>> + * In this case, the hashed flag is set to indicate that hashed
>> + * table has been filled and _Q_SLOW_VAL is set.
>> */
>> - if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
>> - pvstat_inc(pvstat_lock_kick);
>> - pv_kick(pn->cpu);
>> + if ((!pv_kick_ahead || pv_get_kick_node(pn, 1))&&
>> + (xchg(&pn->hashed, 1) == 0)) {
>> + struct __qspinlock *l = (void *)lock;
>> +
>> + /*
>> + * As this is the same vCPU that will check the _Q_SLOW_VAL
>> + * value and the hash table later on at unlock time, no atomic
>> + * instruction is needed.
>> + */
>> + WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>> + (void)pv_hash(lock, pn);
>> + return;
>> }
>> +
>> + /*
>> + * Kicking the vCPU even if it is not really halted is safe.
>> + */
>> + pvstat_inc(pvstat_lock_kick);
>> + pv_kick(pn->cpu);
>> }
>>
>> /*
>> @@ -513,6 +545,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> cpu_relax();
>> }
>>
>> + if (!lp&& (xchg(&pn->hashed, 1) == 1))
>> + /*
>> + * The hashed table& _Q_SLOW_VAL had been filled
>> + * by the lock holder.
>> + */
>> + lp = (struct qspinlock **)-1;
>> +
>> if (!lp) { /* ONCE */
>> lp = pv_hash(lock, pn);
>> /*
> *groan*, so you complained the previous version of this patch was too
> complex, but let me say I vastly preferred it to this one :/
I said it was complex as maintaining a tri-state variable needed more
thought than 2 bi-state variables. I can revert it back to the tri-state
variable as doing an unconditional kick in unlock simplifies the code at
pv_wait_head().
Cheers,
Longman
On Wed, Jul 15, 2015 at 08:18:23PM -0400, Waiman Long wrote:
> On 07/15/2015 05:10 AM, Peter Zijlstra wrote:
> > /*
> >+ * A failed cmpxchg doesn't provide any memory-ordering guarantees,
> >+ * so we need a barrier to order the read of the node data in
> >+ * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> >+ *
> >+ * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
> >+ */
> >+ smp_rmb();
>
> According to memory_barriers.txt, cmpxchg() is a full memory barrier. It
> didn't say a failed cmpxchg will lose its memory guarantee. So is the
> documentation right?
The documentation is not entirely clear on this; but there are hints
that this is so.
> Or is that true for some architectures? I think it is
> not true for x86.
On x86 LOCK CMPXCHG is always a sync point, but yes there are archs for
which a failed cmpxchg does _NOT_ provide any barrier semantics.
The reason I started looking was because Will made Argh64 one of those.
On Wed, Jul 15, 2015 at 10:01:02PM -0400, Waiman Long wrote:
> On 07/15/2015 05:39 AM, Peter Zijlstra wrote:
> >On Tue, Jul 14, 2015 at 10:13:35PM -0400, Waiman Long wrote:
> >>Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
> >>critical section and block forward progress. This patch implements
> >>a kick-ahead mechanism where the unlocker will kick the queue head
> >>vCPUs as well as up to four additional vCPUs next to the queue head
> >>if they were halted. The kickings are done after exiting the critical
> >>section to improve parallelism.
> >>
> >>The amount of kick-ahead allowed depends on the number of vCPUs
> >>in the VM guest. This patch, by itself, won't do much as most of
> >>the kickings are currently done at lock time. Coupled with the next
> >>patch that defers lock time kicking to unlock time, it should improve
> >>overall system performance in a busy overcommitted guest.
> >>
> >>Linux kernel builds were run in KVM guest on an 8-socket, 4
> >>cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
> >>Haswell-EX system. Both systems are configured to have 32 physical
> >>CPUs. The kernel build times before and after the patch were:
> >>
> >> Westmere Haswell
> >> Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
> >> ----- -------- -------- -------- --------
> >> Before patch 3m25.0s 10m34.1s 2m02.0s 15m35.9s
> >> After patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
> >>
> >>There wasn't too much difference before and after the patch.
> >That means either the patch isn't worth it, or as you seem to imply its
> >in the wrong place in this series.
>
> It needs to be coupled with the next patch to be effective as most of the
> kicking are happening at the lock side, instead of at the unlock side. If
> you look at the sample pvqspinlock stats in patch 3:
>
> lock_kick_count=755354
> unlock_kick_count=87
>
> The number of unlock kicks is negligible compared with the lock kicks. Patch
> 5 does have a dependency on patch 4 unless we make it unconditionally defers
> kicking to the unlock call which was what I had done in the v1 patch. The
> reason why I change this in v2 is because I found a very slight performance
> degradation in doing so.
This way we cannot see the gains of the proposed complexity. So put it
in a place where you can.
> >You also do not offer any support for any of the magic numbers..
>
> I chose 4 for PV_KICK_AHEAD_MAX as I didn't see much performance difference
> when I did a kick-ahead of 5. Also, it may be too unfair to the vCPU that
> was doing the kicking if the number is too big. Another magic number is
> pv_kick_ahead number. This one is kind of arbitrary. Right now I do a log2,
> but it can be divided by 4 (rshift 2) as well.
So what was the difference between 1-2-3-4 ? I would be thinking one
extra kick is the biggest help, no?
On Wed, Jul 15, 2015 at 10:18:35PM -0400, Waiman Long wrote:
> On 07/15/2015 06:03 AM, Peter Zijlstra wrote:
> >*groan*, so you complained the previous version of this patch was too
> >complex, but let me say I vastly preferred it to this one :/
>
> I said it was complex as maintaining a tri-state variable needed more
> thought than 2 bi-state variables. I can revert it back to the tri-state
> variable as doing an unconditional kick in unlock simplifies the code at
> pv_wait_head().
Well, your state space isn't shrunk, you just use more variables and I'm
not entirely sure that actually matters.
What also doesn't help is that mixing with the kicking code.
On 07/16/2015 01:42 AM, Peter Zijlstra wrote:
> On Wed, Jul 15, 2015 at 08:18:23PM -0400, Waiman Long wrote:
>> On 07/15/2015 05:10 AM, Peter Zijlstra wrote:
>>> /*
>>> + * A failed cmpxchg doesn't provide any memory-ordering guarantees,
>>> + * so we need a barrier to order the read of the node data in
>>> + * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
>>> + *
>>> + * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
>>> + */
>>> + smp_rmb();
>> According to memory_barriers.txt, cmpxchg() is a full memory barrier. It
>> didn't say a failed cmpxchg will lose its memory guarantee. So is the
>> documentation right?
> The documentation is not entirely clear on this; but there are hints
> that this is so.
>
>> Or is that true for some architectures? I think it is
>> not true for x86.
> On x86 LOCK CMPXCHG is always a sync point, but yes there are archs for
> which a failed cmpxchg does _NOT_ provide any barrier semantics.
>
> The reason I started looking was because Will made Argh64 one of those.
That is what I suspected. In that case, I am fine with the patch as
smp_rmb() is an nop in x86 anyway.
Acked-by: Waiman Long <[email protected]>
BTW, I think we also need to update the documentation to make it clear
that a failed cmpxchg() or atomic_cmpxchg() may not be a full memory
barrier as most people may not be aware of that.
Cheers,
Longman
On 07/16/2015 01:46 AM, Peter Zijlstra wrote:
> On Wed, Jul 15, 2015 at 10:01:02PM -0400, Waiman Long wrote:
>> On 07/15/2015 05:39 AM, Peter Zijlstra wrote:
>>> On Tue, Jul 14, 2015 at 10:13:35PM -0400, Waiman Long wrote:
>>>> Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
>>>> critical section and block forward progress. This patch implements
>>>> a kick-ahead mechanism where the unlocker will kick the queue head
>>>> vCPUs as well as up to four additional vCPUs next to the queue head
>>>> if they were halted. The kickings are done after exiting the critical
>>>> section to improve parallelism.
>>>>
>>>> The amount of kick-ahead allowed depends on the number of vCPUs
>>>> in the VM guest. This patch, by itself, won't do much as most of
>>>> the kickings are currently done at lock time. Coupled with the next
>>>> patch that defers lock time kicking to unlock time, it should improve
>>>> overall system performance in a busy overcommitted guest.
>>>>
>>>> Linux kernel builds were run in KVM guest on an 8-socket, 4
>>>> cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
>>>> Haswell-EX system. Both systems are configured to have 32 physical
>>>> CPUs. The kernel build times before and after the patch were:
>>>>
>>>> Westmere Haswell
>>>> Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
>>>> ----- -------- -------- -------- --------
>>>> Before patch 3m25.0s 10m34.1s 2m02.0s 15m35.9s
>>>> After patch 3m27.4s 10m32.0s 2m00.8s 14m52.5s
>>>>
>>>> There wasn't too much difference before and after the patch.
>>> That means either the patch isn't worth it, or as you seem to imply its
>>> in the wrong place in this series.
>> It needs to be coupled with the next patch to be effective as most of the
>> kicking are happening at the lock side, instead of at the unlock side. If
>> you look at the sample pvqspinlock stats in patch 3:
>>
>> lock_kick_count=755354
>> unlock_kick_count=87
>>
>> The number of unlock kicks is negligible compared with the lock kicks. Patch
>> 5 does have a dependency on patch 4 unless we make it unconditionally defers
>> kicking to the unlock call which was what I had done in the v1 patch. The
>> reason why I change this in v2 is because I found a very slight performance
>> degradation in doing so.
> This way we cannot see the gains of the proposed complexity. So put it
> in a place where you can.
OK, I will see what I can do to make the performance change more visible
on a patch-by-patch basis.
>>> You also do not offer any support for any of the magic numbers..
>> I chose 4 for PV_KICK_AHEAD_MAX as I didn't see much performance difference
>> when I did a kick-ahead of 5. Also, it may be too unfair to the vCPU that
>> was doing the kicking if the number is too big. Another magic number is
>> pv_kick_ahead number. This one is kind of arbitrary. Right now I do a log2,
>> but it can be divided by 4 (rshift 2) as well.
> So what was the difference between 1-2-3-4 ? I would be thinking one
> extra kick is the biggest help, no?
I was seeing diminishing returns with more kicks. I can add a table on
that in the next patch.
Cheers,
Longman
On 07/16/2015 10:07 AM, Waiman Long wrote:
> On 07/16/2015 01:42 AM, Peter Zijlstra wrote:
>> On Wed, Jul 15, 2015 at 08:18:23PM -0400, Waiman Long wrote:
>>> On 07/15/2015 05:10 AM, Peter Zijlstra wrote:
>>>> /*
>>>> + * A failed cmpxchg doesn't provide any memory-ordering
>>>> guarantees,
>>>> + * so we need a barrier to order the read of the node data in
>>>> + * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
>>>> + *
>>>> + * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
>>>> + */
>>>> + smp_rmb();
>>> According to memory_barriers.txt, cmpxchg() is a full memory
>>> barrier. It
>>> didn't say a failed cmpxchg will lose its memory guarantee. So is the
>>> documentation right?
>> The documentation is not entirely clear on this; but there are hints
>> that this is so.
>>
>>> Or is that true for some architectures? I think it is
>>> not true for x86.
>> On x86 LOCK CMPXCHG is always a sync point, but yes there are archs for
>> which a failed cmpxchg does _NOT_ provide any barrier semantics.
>>
>> The reason I started looking was because Will made Argh64 one of those.
>
> That is what I suspected. In that case, I am fine with the patch as
> smp_rmb() is an nop in x86 anyway.
>
> Acked-by: Waiman Long <[email protected]>
>
> BTW, I think we also need to update the documentation to make it clear
> that a failed cmpxchg() or atomic_cmpxchg() may not be a full memory
> barrier as most people may not be aware of that.
>
I suspect that there may be other places in the kernel that have similar
problem. An alternative will be to strengthen cmpxchg() in architectures
like arm64 to act as full memory barrier no matter the result and
defined relaxed version with no such guarantee.
Cheers,
Longman
On Thu, Jul 16, 2015 at 04:04:30PM +0100, Waiman Long wrote:
> On 07/16/2015 10:07 AM, Waiman Long wrote:
> > On 07/16/2015 01:42 AM, Peter Zijlstra wrote:
> >> On Wed, Jul 15, 2015 at 08:18:23PM -0400, Waiman Long wrote:
> >>> On 07/15/2015 05:10 AM, Peter Zijlstra wrote:
> >>>> /*
> >>>> + * A failed cmpxchg doesn't provide any memory-ordering
> >>>> guarantees,
> >>>> + * so we need a barrier to order the read of the node data in
> >>>> + * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> >>>> + *
> >>>> + * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
> >>>> + */
> >>>> + smp_rmb();
> >>> According to memory_barriers.txt, cmpxchg() is a full memory
> >>> barrier. It
> >>> didn't say a failed cmpxchg will lose its memory guarantee. So is the
> >>> documentation right?
> >> The documentation is not entirely clear on this; but there are hints
> >> that this is so.
> >>
> >>> Or is that true for some architectures? I think it is
> >>> not true for x86.
> >> On x86 LOCK CMPXCHG is always a sync point, but yes there are archs for
> >> which a failed cmpxchg does _NOT_ provide any barrier semantics.
> >>
> >> The reason I started looking was because Will made Argh64 one of those.
> >
> > That is what I suspected. In that case, I am fine with the patch as
> > smp_rmb() is an nop in x86 anyway.
> >
> > Acked-by: Waiman Long <[email protected]>
> >
> > BTW, I think we also need to update the documentation to make it clear
> > that a failed cmpxchg() or atomic_cmpxchg() may not be a full memory
> > barrier as most people may not be aware of that.
> >
I already have a small patch for the Documentation (see below -- Peter,
you can pick this up too if you like).
> I suspect that there may be other places in the kernel that have similar
> problem. An alternative will be to strengthen cmpxchg() in architectures
> like arm64 to act as full memory barrier no matter the result and
> defined relaxed version with no such guarantee.
Could do, but I don't think the ordering requirements are needed in the
vast majority of cases (as in, this case was the only one we could find)
and it gets really muddy when you want to define the semantics of something
like a failed cmpxchg_release.
Will
--->8
commit b2fdae71bdb621f62450076af4126cd8bca22918
Author: Will Deacon <[email protected]>
Date: Mon Jul 13 12:27:30 2015 +0100
documentation: Clarify failed cmpxchg memory ordering semantics
A failed cmpxchg does not provide any memory ordering guarantees, a
property that is used to optimise the cmpxchg implementations on Alpha,
PowerPC and arm64.
This patch updates atomic_ops.txt and memory-barriers.txt to reflect
this.
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index dab6da3382d9..b19fc34efdb1 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -266,7 +266,9 @@ with the given old and new values. Like all atomic_xxx operations,
atomic_cmpxchg will only satisfy its atomicity semantics as long as all
other accesses of *v are performed through atomic_xxx operations.
-atomic_cmpxchg must provide explicit memory barriers around the operation.
+atomic_cmpxchg must provide explicit memory barriers around the operation,
+although if the comparison fails then no memory ordering guarantees are
+required.
The semantics for atomic_cmpxchg are the same as those defined for 'cas'
below.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 13feb697271f..18fc860df1be 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2383,9 +2383,7 @@ about the state (old or new) implies an SMP-conditional general memory barrier
explicit lock operations, described later). These include:
xchg();
- cmpxchg();
atomic_xchg(); atomic_long_xchg();
- atomic_cmpxchg(); atomic_long_cmpxchg();
atomic_inc_return(); atomic_long_inc_return();
atomic_dec_return(); atomic_long_dec_return();
atomic_add_return(); atomic_long_add_return();
@@ -2398,7 +2396,9 @@ explicit lock operations, described later). These include:
test_and_clear_bit();
test_and_change_bit();
- /* when succeeds (returns 1) */
+ /* when succeeds */
+ cmpxchg();
+ atomic_cmpxchg(); atomic_long_cmpxchg();
atomic_add_unless(); atomic_long_add_unless();
These are used for such things as implementing ACQUIRE-class and RELEASE-class
Commit-ID: ed2de9f74ecbbf3063d29b2334e7b455d7f35189
Gitweb: http://git.kernel.org/tip/ed2de9f74ecbbf3063d29b2334e7b455d7f35189
Author: Will Deacon <[email protected]>
AuthorDate: Thu, 16 Jul 2015 16:10:06 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 10:57:09 +0200
locking/Documentation: Clarify failed cmpxchg() memory ordering semantics
A failed cmpxchg does not provide any memory ordering guarantees, a
property that is used to optimise the cmpxchg implementations on Alpha,
PowerPC and arm64.
This patch updates atomic_ops.txt and memory-barriers.txt to reflect
this.
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Waiman Long <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/atomic_ops.txt | 4 +++-
Documentation/memory-barriers.txt | 6 +++---
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index dab6da3..b19fc34 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -266,7 +266,9 @@ with the given old and new values. Like all atomic_xxx operations,
atomic_cmpxchg will only satisfy its atomicity semantics as long as all
other accesses of *v are performed through atomic_xxx operations.
-atomic_cmpxchg must provide explicit memory barriers around the operation.
+atomic_cmpxchg must provide explicit memory barriers around the operation,
+although if the comparison fails then no memory ordering guarantees are
+required.
The semantics for atomic_cmpxchg are the same as those defined for 'cas'
below.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 13feb69..18fc860 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2383,9 +2383,7 @@ about the state (old or new) implies an SMP-conditional general memory barrier
explicit lock operations, described later). These include:
xchg();
- cmpxchg();
atomic_xchg(); atomic_long_xchg();
- atomic_cmpxchg(); atomic_long_cmpxchg();
atomic_inc_return(); atomic_long_inc_return();
atomic_dec_return(); atomic_long_dec_return();
atomic_add_return(); atomic_long_add_return();
@@ -2398,7 +2396,9 @@ explicit lock operations, described later). These include:
test_and_clear_bit();
test_and_change_bit();
- /* when succeeds (returns 1) */
+ /* when succeeds */
+ cmpxchg();
+ atomic_cmpxchg(); atomic_long_cmpxchg();
atomic_add_unless(); atomic_long_add_unless();
These are used for such things as implementing ACQUIRE-class and RELEASE-class
On Mon, 2015-08-03 at 09:59 -0700, tip-bot for Will Deacon wrote:
> Commit-ID: ed2de9f74ecbbf3063d29b2334e7b455d7f35189
> Gitweb: http://git.kernel.org/tip/ed2de9f74ecbbf3063d29b2334e7b455d7f35189
> Author: Will Deacon <[email protected]>
> AuthorDate: Thu, 16 Jul 2015 16:10:06 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Mon, 3 Aug 2015 10:57:09 +0200
>
> locking/Documentation: Clarify failed cmpxchg() memory ordering semantics
>
> A failed cmpxchg does not provide any memory ordering guarantees, a
> property that is used to optimise the cmpxchg implementations on Alpha,
> PowerPC and arm64.
>
> This patch updates atomic_ops.txt and memory-barriers.txt to reflect
> this.
>
> Signed-off-by: Will Deacon <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Douglas Hatch <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Scott J Norton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Waiman Long <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> Documentation/atomic_ops.txt | 4 +++-
> Documentation/memory-barriers.txt | 6 +++---
> 2 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
> index dab6da3..b19fc34 100644
> --- a/Documentation/atomic_ops.txt
> +++ b/Documentation/atomic_ops.txt
> @@ -266,7 +266,9 @@ with the given old and new values. Like all atomic_xxx operations,
> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> other accesses of *v are performed through atomic_xxx operations.
>
> -atomic_cmpxchg must provide explicit memory barriers around the operation.
> +atomic_cmpxchg must provide explicit memory barriers around the operation,
> +although if the comparison fails then no memory ordering guarantees are
> +required.
Thanks, I also stumbled upon this with the wake-q stuff.