2015-08-08 03:18:26

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 0/6] locking/qspinlock: Enhance pvqspinlock performance

v4->v5:
- Rebased the patch to the latest tip tree.
- Corrected the comments and commit log for patch 1.
- Removed the v4 patch 5 as PV kick deferment is no longer needed with
the new tip tree.
- Simplified the adaptive spinning patch (patch 6) & improve its
performance a bit further.
- Re-ran the benchmark test with the new patch.

v3->v4:
- Patch 1: add comment about possible racing condition in PV unlock.
- Patch 2: simplified the pv_pending_lock() function as suggested by
Davidlohr.
- Move PV unlock optimization patch forward to patch 4 & rerun
performance test.

v2->v3:
- Moved deferred kicking enablement patch forward & move back
the kick-ahead patch to make the effect of kick-ahead more visible.
- Reworked patch 6 to make it more readable.
- Reverted back to use state as a tri-state variable instead of
adding an additional bistate variable.
- Added performance data for different values of PV_KICK_AHEAD_MAX.
- Add a new patch to optimize PV unlock code path performance.

v1->v2:
- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.

Patch 1 simplifies the unlock code by removing the unnecessary
state check.

Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.

Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 4 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 5 enables multiple vCPU kick-ahead's at unlock time, outside of
the critical section which can improve performance in overcommitted
guests and sometime even in normal guests.

Patch 6 enables adaptive spinning in the queue nodes. This patch can
lead to pretty big performance increase in over-committed guest at
the expense of a slight performance hit in normal guests.

Patches 2 & 4 improves performance of common uncontended and lightly
contended cases. Patches 5-6 are for improving performance in
over-committed VM guests.

Performance measurements were done on a 32-CPU Westmere-EX and
Haswell-EX systems. The Westmere-EX system got the most performance
gain from patch 5, whereas the Haswell-EX system got the most gain
from patch 6 for over-committed guests.

The table below shows the Linux kernel build times for various
values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on
the Westmere-EX system:

PV_KICK_AHEAD_MAX Patches 1-5 Patches 1-6
----------------- ----------- -----------
1 9m46.9s 11m10.1s
2 9m40.2s 10m08.3s
3 9m36.8s 9m49.8s
4 9m35.9s 9m38.7s
5 9m35.1s 9m33.0s
6 9m35.7s 9m28.5s

With patches 1-5, the performance wasn't very sensitive to different
PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes
the picture quite dramatically. There is a performance regression if
PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing
PV_KICK_AHEAD_MAX only gets us a minor benefit.

Waiman Long (6):
locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
locking/pvqspinlock: Add pending bit support
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock, x86: Optimize PV unlock code path
locking/pvqspinlock: Allow vCPUs kick-ahead
locking/pvqspinlock: Queue node adaptive spinning

arch/x86/Kconfig | 7 +
arch/x86/include/asm/qspinlock_paravirt.h | 59 ++++
kernel/locking/qspinlock.c | 32 ++-
kernel/locking/qspinlock_paravirt.h | 475 +++++++++++++++++++++++++++--
4 files changed, 542 insertions(+), 31 deletions(-)


2015-08-08 03:18:27

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 1/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
The extra check at the end of __pv_queued_spin_unlock() is unnecessary
and so is removed.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c8e6e9a..6eafb9e 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -364,8 +364,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* vCPU is harmless other than the additional latency in completing
* the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_hashed)
- pv_kick(node->cpu);
+ pv_kick(node->cpu);
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1

2015-08-08 03:19:10

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 2/6] locking/pvqspinlock: Add pending bit support

Like the native qspinlock, using the pending bit when it is lightly
loaded to acquire the lock is faster than going through the PV queuing
process which is even slower than the native queuing process. It also
avoids loading two additional cachelines (the MCS and PV nodes).

This patch adds the pending bit support for PV qspinlock. The pending
bit code has a smaller spin threshold (1<<10). It will default back
to the queuing method if it cannot acquired the lock within a certain
time limit.

On a VM with 32 vCPUs on a 32-core Westmere-EX box, the kernel
build times on 4.2-rc1 based kernels were:

Kernel Build Time Sys Time
------ ---------- --------
w/o patch 3m28.5s 28m17.5s
with patch 3m19.3s 23m55.7s

Using a locking microbenchmark on the same system, the locking
rates in (kops/s) were:

Threads Rate w/o patch Rate with patch
------- -------------- ---------------
2 (same socket) 6,515,265 7,077,476
2 (diff sockets) 2,967,145 4,353,851

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 27 ++++++++++++-
kernel/locking/qspinlock_paravirt.h | 73 +++++++++++++++++++++++++++++++++++
2 files changed, 99 insertions(+), 1 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 337c881..94fdd27 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
}

+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->pending, 0);
+}
+
/*
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
@@ -193,6 +204,15 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
}

/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL, &lock->val);
+}
+
+/**
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
* @tail : The new queue tail code word
@@ -245,6 +265,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

#define pv_enabled() false
+#define pv_pending_lock(l, v) false

#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
@@ -286,8 +307,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

- if (pv_enabled())
+ if (pv_enabled()) {
+ if (pv_pending_lock(lock, val))
+ return; /* Got the lock via pending bit */
goto queue;
+ }

if (virt_queued_spin_lock(lock))
return;
@@ -463,6 +487,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_wait_node
#undef pv_kick_node
#undef pv_wait_head
+#undef pv_pending_lock

#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6eafb9e..94f9adf 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,6 +23,14 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

/*
+ * Queued Spinlock Spin Threshold
+ *
+ * The vCPU will spin a relatively short time in pending mode before falling
+ * back to queuing.
+ */
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+
+/*
* Queue node uses: vcpu_running & vcpu_halted.
* Queue head uses: vcpu_running & vcpu_hashed.
*/
@@ -157,6 +165,71 @@ static void pv_init_node(struct mcs_spinlock *node)
}

/*
+ * Try to acquire the lock and wait using the pending bit within a certain
+ * threshold as specified by PENDING_SPIN_THRESHOLD. If the threshold has
+ * been exceeded without getting the lock, we fall back to queuing.
+ */
+static int pv_pending_lock(struct qspinlock *lock, u32 val)
+{
+ int loop = PENDING_SPIN_THRESHOLD;
+ u32 new, old;
+
+ /*
+ * wait for in-progress pending->locked hand-overs
+ */
+ while ((val == _Q_PENDING_VAL) && loop) {
+ cpu_relax();
+ val = atomic_read(&lock->val);
+ loop--;
+ }
+
+ /*
+ * trylock || pending
+ *
+ * This loop does a trylock if lock is free or sets the pending bit
+ * if lock is taken until the cmpxchg succeeds. As it is expected
+ * that we need no more than a few cmpxchg's to have a successful
+ * one, we don't check the loop count here.
+ */
+ for (;;) {
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+ val = old;
+ }
+
+ if (new == _Q_LOCKED_VAL)
+ goto gotlock; /* Trylock succeeded */
+ /*
+ * We are pending, wait for the owner to go away.
+ */
+ for (; loop; loop--, cpu_relax()) {
+ val = smp_load_acquire(&lock->val.counter);
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ goto gotlock;
+ }
+ }
+
+ /*
+ * Fail to acquire the lock within the threshold period with the
+ * pending bit set, so clear the pending bit and fall back to queuing.
+ */
+ clear_pending(lock);
+
+queue:
+ return 0;
+
+gotlock:
+ return 1;
+}
+
+/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
* behalf.
--
1.7.1

2015-08-08 03:19:01

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 3/6] locking/pvqspinlock: Collect slowpath lock statistics

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.

The measured latencies for different CPUs are:

CPU Wakeup Kicking
--- ------ -------
Haswell-EX 89.8us 7.4us
Westmere-EX 67.6us 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistics counts after a kernel build (no CPU overcommit)
was:

hash_hops_count=576912
kick_latencies=5258025484
kick_unlock_count=576911
kick_wait_count=576903
pending_fail_count=10722
pending_lock_count=6123545
spurious_wakeup=92
wait_again_count=75
wait_head_count=60
wait_node_count=576936
wake_latencies=37061460652

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++
kernel/locking/qspinlock_paravirt.h | 178 ++++++++++++++++++++++++++++++++++-
2 files changed, 180 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c9c8b8..86bf53e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -719,6 +719,13 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer Y.

+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+
source "arch/x86/xen/Kconfig"

config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 94f9adf..5eb5dea 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -49,6 +49,151 @@ struct pv_node {
};

/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_wait_again,
+ pvstat_kick_wait,
+ pvstat_kick_unlock,
+ pvstat_pend_lock,
+ pvstat_pend_fail,
+ pvstat_spurious,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_again] = "wait_again_count",
+ [pvstat_kick_wait] = "kick_wait_count",
+ [pvstat_kick_unlock] = "kick_unlock_count",
+ [pvstat_pend_lock] = "pending_lock_count",
+ [pvstat_pend_fail] = "pending_fail_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/kick_unlock_count
+ * Avg wake latency = pv_wake_latencies/kick_wait_count
+ * Avg # of hops/hash = hash_hops_count/kick_unlock_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+ *pkick_time = 0;
+ pv_wait(ptr, val);
+ if (*pkick_time) {
+ atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_wait);
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -108,10 +253,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;

for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -212,6 +360,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
val = smp_load_acquire(&lock->val.counter);
if (!(val & _Q_LOCKED_MASK)) {
clear_pending_set_locked(lock);
+ pvstat_inc(pvstat_pend_lock);
goto gotlock;
}
}
@@ -221,11 +370,13 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
* pending bit set, so clear the pending bit and fall back to queuing.
*/
clear_pending(lock);
+ pvstat_inc(pvstat_pend_fail);

queue:
return 0;

gotlock:
+ pvstat_inc(pvstat_pend_lock);
return 1;
}

@@ -237,9 +388,10 @@ gotlock:
static void pv_wait_node(struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ int waitcnt = 0;
int loop;

- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
@@ -257,15 +409,22 @@ static void pv_wait_node(struct mcs_spinlock *node)
*/
smp_store_mb(pn->state, vcpu_halted);

- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&pn->state, vcpu_halted);
+ }

/*
- * If pv_kick_node() changed us to vcpu_hashed, retain that value
- * so that pv_wait_head() knows to not also try to hash this lock.
+ * If pv_kick_node() changed us to vcpu_hashed, retain that
+ * value so that pv_wait_head() knows to not also try to hash
+ * this lock.
*/
cmpxchg(&pn->state, vcpu_halted, vcpu_running);

+ if (READ_ONCE(node->locked))
+ break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -273,6 +432,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious);
}

/*
@@ -323,6 +483,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct pv_node *pn = (struct pv_node *)node;
struct __qspinlock *l = (void *)lock;
struct qspinlock **lp = NULL;
+ int waitcnt = 0;
int loop;

/*
@@ -332,7 +493,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
if (READ_ONCE(pn->state) == vcpu_hashed)
lp = (struct qspinlock **)1;

- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
@@ -366,14 +527,20 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&l->locked, _Q_SLOW_VAL);

+ if (!READ_ONCE(l->locked))
+ return;
/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
* wakeup and so the vCPU should wait again after spinning for
* a while.
*/
+ pvstat_inc(pvstat_spurious);
}

/*
@@ -437,6 +604,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* vCPU is harmless other than the additional latency in completing
* the unlock.
*/
+ pvstat_inc(pvstat_kick_unlock);
pv_kick(node->cpu);
}
/*
--
1.7.1

2015-08-08 03:19:04

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 4/6] locking/pvqspinlock, x86: Optimize PV unlock code path

The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:
1) Moving the unlock slowpath code from the fastpath into a separate
__pv_queued_spin_unlock_slowpath() function to make the fastpath
as simple as possible..
2) For x86-64, hand-coded an assembly function to combine the register
saving thunk code with the fastpath code. Only registers that
are used in the fastpath will be saved and restored. If the
fastpath fails, the slowpath function will be called via another
PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
__pv_queued_spin_unlock() code as the thunk saves and restores
only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

Threads Before patch After patch % Change
------- ------------ ----------- --------
1 134.1 ms 119.3 ms -11%
2 1286 ms 953 ms -26%
3 3715 ms 3480 ms -6.3%
4 4092 ms 3764 ms -8.0%

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock_paravirt.h | 59 +++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 43 +++++++++++++--------
2 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
#ifndef __ASM_QSPINLOCK_PARAVIRT_H
#define __ASM_QSPINLOCK_PARAVIRT_H

+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock __pv_queued_spin_unlock
+#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ * struct __qspinlock *l = (void *)lock;
+ * u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ * if (likely(lockval == _Q_LOCKED_VAL))
+ * return;
+ * pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ * rdi = lock (first argument)
+ * rsi = lockval (second argument)
+ * rdx = internal variable (set to 0)
+ */
+asm(".pushsection .text;"
+ ".globl " PV_UNLOCK ";"
+ ".align 4,0x90;"
+ PV_UNLOCK ": "
+ "push %rdx;"
+ "mov $0x1,%eax;"
+ "xor %edx,%edx;"
+ "lock cmpxchg %dl,(%rdi);"
+ "cmp $0x1,%al;"
+ "jne .slowpath;"
+ "pop %rdx;"
+ "ret;"
+ ".slowpath: "
+ "push %rsi;"
+ "movzbl %al,%esi;"
+ "call " PV_UNLOCK_SLOWPATH ";"
+ "pop %rsi;"
+ "pop %rdx;"
+ "ret;"
+ ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+ ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);

+#endif /* CONFIG_64BIT */
#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5eb5dea..7c9d6ed 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -551,23 +551,14 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}

/*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
*/
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node;
- u8 locked;
-
- /*
- * We must not unlock if SLOW, because in that case we must first
- * unhash. Otherwise it would be possible to have multiple @lock
- * entries, which would be BAD.
- */
- locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
- if (likely(locked == _Q_LOCKED_VAL))
- return;

if (unlikely(locked != _Q_SLOW_VAL)) {
WARN(!debug_locks_silent,
@@ -607,12 +598,32 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
pvstat_inc(pvstat_kick_unlock);
pv_kick(node->cpu);
}
+
/*
* Include the architecture specific callee-save thunk of the
* __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
*/
#include <asm/qspinlock_paravirt.h>

+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ u8 locked;
+
+ /*
+ * We must not unlock if SLOW, because in that case we must first
+ * unhash. Otherwise it would be possible to have multiple @lock
+ * entries, which would be BAD.
+ */
+ locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ if (likely(locked == _Q_LOCKED_VAL))
+ return;
+
+ __pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif /* __pv_queued_spin_unlock */
--
1.7.1

2015-08-08 03:19:06

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 5/6] locking/pvqspinlock: Allow vCPUs kick-ahead

Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
critical section and block forward progress. This patch implements
a kick-ahead mechanism where the unlocker will kick the queue head
vCPUs as well as up to four additional vCPUs next to the queue head
if they were halted. The kickings are done after exiting the critical
section to improve parallelism.

The amount of kick-ahead allowed depends on the number of vCPUs
in the VM guest. Currently it allows up to 1 vCPU kick-ahead per
4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX (4). There
are diminishing returns in increasing the maximum value. The current
value of 4 is a compromise of getting a nice performance boost without
penalizing too much on the one vCPU that is doing all the kickings.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m21.9s 11m20.6s 2m08.6s 17m12.8s
After patch 3m03.2s 9m21.1s 2m08.9s 16m14.8s

This improves performance quite substantially on Westmere, but not
so much on Haswell.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 71 +++++++++++++++++++++++++++++++++-
1 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 7c9d6ed..9996609 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -57,6 +57,7 @@ enum pv_qlock_stat {
pvstat_wait_again,
pvstat_kick_wait,
pvstat_kick_unlock,
+ pvstat_kick_ahead,
pvstat_pend_lock,
pvstat_pend_fail,
pvstat_spurious,
@@ -77,6 +78,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_again] = "wait_again_count",
[pvstat_kick_wait] = "kick_wait_count",
[pvstat_kick_unlock] = "kick_unlock_count",
+ [pvstat_kick_ahead] = "kick_ahead_count",
[pvstat_pend_lock] = "pending_lock_count",
[pvstat_pend_fail] = "pending_fail_count",
[pvstat_spurious] = "spurious_wakeup",
@@ -89,7 +91,7 @@ static atomic_t pvstats[pvstat_num];
* pv_kick_latencies = sum of all pv_kick latencies in ns
* pv_wake_latencies = sum of all wakeup latencies in ns
*
- * Avg kick latency = pv_kick_latencies/kick_unlock_count
+ * Avg kick latency = pv_kick_latencies/(kick_unlock_count + kick_ahead_count)
* Avg wake latency = pv_wake_latencies/kick_wait_count
* Avg # of hops/hash = hash_hops_count/kick_unlock_count
*/
@@ -221,6 +223,18 @@ static struct pv_hash_entry *pv_lock_hash;
static unsigned int pv_lock_hash_bits __read_mostly;

/*
+ * Allow kick-ahead of vCPUs at unlock time
+ *
+ * The pv_kick_ahead value is set by a simple formula that 1 vCPU kick-ahead
+ * is allowed per 4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX.
+ * There are diminishing returns in increasing PV_KICK_AHEAD_MAX. The current
+ * value of 4 is a good compromise that gives a good performance boost without
+ * penalizing the vCPU that is doing the kicking by too much.
+ */
+#define PV_KICK_AHEAD_MAX 4
+static int pv_kick_ahead __read_mostly;
+
+/*
* Allocate memory for the PV qspinlock hash buckets
*
* This function should be called from the paravirt spinlock initialization
@@ -228,7 +242,8 @@ static unsigned int pv_lock_hash_bits __read_mostly;
*/
void __init __pv_init_lock_hash(void)
{
- int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+ int ncpus = num_possible_cpus();
+ int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);

if (pv_hash_size < PV_HE_MIN)
pv_hash_size = PV_HE_MIN;
@@ -242,6 +257,13 @@ void __init __pv_init_lock_hash(void)
pv_hash_size, 0, HASH_EARLY,
&pv_lock_hash_bits, NULL,
pv_hash_size, pv_hash_size);
+ /*
+ * Enable the unlock kick ahead mode according to the number of
+ * vCPUs available.
+ */
+ pv_kick_ahead = min(ncpus/4, PV_KICK_AHEAD_MAX);
+ if (pv_kick_ahead)
+ pr_info("PV unlock kick ahead max count = %d\n", pv_kick_ahead);
}

#define for_each_hash_entry(he, offset, hash) \
@@ -551,6 +573,26 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}

/*
+ * Helper to get the address of the next kickable node
+ *
+ * The node has to be in the halted state. The state will then be
+ * transitioned to the running state. If no kickable node is found, NULL
+ * will be returned.
+ */
+static inline struct pv_node *pv_get_kick_node(struct pv_node *node)
+{
+ struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
+
+ if (!next || (READ_ONCE(next->state) != vcpu_halted))
+ return NULL;
+
+ if (xchg(&next->state, vcpu_running) != vcpu_halted)
+ next = NULL; /* No kicking is needed */
+
+ return next;
+}
+
+/*
* PV versions of the unlock fastpath and slowpath functions to be used
* instead of queued_spin_unlock().
*/
@@ -558,7 +600,8 @@ __visible void
__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
{
struct __qspinlock *l = (void *)lock;
- struct pv_node *node;
+ struct pv_node *node, *next;
+ int i, nr_kick, cpus[PV_KICK_AHEAD_MAX];

if (unlikely(locked != _Q_SLOW_VAL)) {
WARN(!debug_locks_silent,
@@ -583,6 +626,20 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
node = pv_unhash(lock);

/*
+ * Implement unlock kick-ahead
+ *
+ * Access the next group of nodes, if available, and prepare to kick
+ * them after releasing the lock if they are in the halted state. This
+ * should improve performance on an overcommitted system.
+ */
+ for (nr_kick = 0, next = node; nr_kick < pv_kick_ahead; nr_kick++) {
+ next = pv_get_kick_node(next);
+ if (!next)
+ break;
+ cpus[nr_kick] = next->cpu;
+ }
+
+ /*
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
@@ -597,6 +654,14 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
*/
pvstat_inc(pvstat_kick_unlock);
pv_kick(node->cpu);
+
+ /*
+ * Kick the next group of vCPUs, if available.
+ */
+ for (i = 0; i < nr_kick; i++) {
+ pvstat_inc(pvstat_kick_ahead);
+ pv_kick(cpus[i]);
+ }
}

/*
--
1.7.1

2015-08-08 03:19:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 6/6] locking/pvqspinlock: Queue node adaptive spinning

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

Reducing the spinning threshold is found to improve performance in
an overcommitted VM guest, but decrease performance when there is
no overcommittment.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() earlier if all the following conditions are true:

1) the vCPU has not been halted before;
2) the previous vCPU is in the halted state;
3) there are a lot of pv_wait() for the current vCPU recently.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m03.2s 9m21.1s 2m08.9s 16m14.8s
After patch 3m04.1s 9m28.5s 2m09.5s 8m29.3s

This patch seemed to cause a tiny bit of performance degraduation
for 32 vCPUs. For 48 vCPUs, there wasn't much change for Westmere,
but a pretty big performance jump for Haswell.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 5 +-
kernel/locking/qspinlock_paravirt.h | 111 +++++++++++++++++++++++++++++++++-
2 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 94fdd27..da39d43 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -258,7 +258,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline void __pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
@@ -415,7 +416,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);

- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 9996609..f03bd7a 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -31,6 +31,38 @@
#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)

/*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will spin less if the following conditions are all true:
+ * 1) vCPU in the previous node is halted
+ * 2) it has not been halted before
+ * 3) there is a lot of pv_wait() in the curent vCPU recently
+ *
+ * The last condition is being monitored by the wait_hist field in the pv_node
+ * structure which tracks the history of pv_wait() relative to slowpath calls.
+ * Each pv_wait will increment this field by PV_WAITHIST_INC until it exceeds
+ * PV_WAITHIST_MAX. Each slowpath lock call will decrement it by 1 until it
+ * reaches PV_WAITHIST_MIN. If its value is higher than PV_WAITHIST_THRESHOLD,
+ * the vCPU will spin less. The reason for this adaptive spinning is to try
+ * to enable wait-early mode only on over-committed guest which helps
+ * performance. However, it shouldn't be enabled when the guest is not
+ * over-committed as it will hurt performance.
+ *
+ * With PV_WAITHIST_INC set to 4, each pv_wait() while not in wait-early mode
+ * will increment wait_hist by 3. Each slowpath call without pv_wait() will
+ * decrement wait_hist by 1. The threshold is set at about 3/4 of the range
+ * so that about 10 steps from the edges in either direction will reach the
+ * threshold. If, on average, more than 1/4 of all slowpath calls results in
+ * a pv_wait(), it should stay in the wait-early mode.
+ */
+#define PV_WAITHIST_MASK 0xff
+#define PV_WAITHIST_INC 4
+#define PV_WAITHIST_MIN 1
+#define PV_WAITHIST_MAX 40
+#define PV_WAITHIST_THRESHOLD 30
+#define PV_CAN_WAIT_EARLY(w) ((w)->wait_hist > PV_WAITHIST_THRESHOLD)
+
+/*
* Queue node uses: vcpu_running & vcpu_halted.
* Queue head uses: vcpu_running & vcpu_hashed.
*/
@@ -46,6 +78,8 @@ struct pv_node {

int cpu;
u8 state;
+ u8 wait_hist;
+ u8 wait_early;
};

/*
@@ -55,6 +89,7 @@ enum pv_qlock_stat {
pvstat_wait_head,
pvstat_wait_node,
pvstat_wait_again,
+ pvstat_wait_early,
pvstat_kick_wait,
pvstat_kick_unlock,
pvstat_kick_ahead,
@@ -76,6 +111,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_head] = "wait_head_count",
[pvstat_wait_node] = "wait_node_count",
[pvstat_wait_again] = "wait_again_count",
+ [pvstat_wait_early] = "wait_early_count",
[pvstat_kick_wait] = "kick_wait_count",
[pvstat_kick_unlock] = "kick_unlock_count",
[pvstat_kick_ahead] = "kick_ahead_count",
@@ -322,6 +358,63 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
}

/*
+ * Helper functions to maintain the wait_hist field.
+ */
+static inline void pv_inc_waithist(struct pv_node *node)
+{
+ /*
+ * pv_wait() in wait_early mode doesn't count as much as !wait_early
+ */
+ if (node->wait_hist < PV_WAITHIST_MAX)
+ node->wait_hist += node->wait_early ? 1 : PV_WAITHIST_INC;
+}
+
+static inline void pv_dec_waithist(struct pv_node *node)
+{
+ node->wait_early = 0;
+ if (node->wait_hist > PV_WAITHIST_MIN)
+ node->wait_hist--;
+}
+
+/*
+ * Return true if the current queue node vCPU should call pv_wait() now.
+ *
+ * The following conditions have to be true for this function to return true:
+ * 1) vCPU in the previous node is halted
+ * 2) it has not been halted before (waitcnt == 0)
+ * 3) it is time to perform the check ((loop & PV_WAITHIST_MASK) == 0)
+ * 4) PV_CAN_WAIT_EARLY() is true.
+ * 5) for the vCPU next to the queue head, it has to spin at least 1/2 of
+ * SPIN_THRESHOLD before waiting.
+ *
+ * In essence, this function causes the queue node vCPU to halt itself whenever
+ * the previous one has been halted. However, if it has been halted and kicked
+ * before, it is assumed to be near the queue head and is going to get the lock
+ * soon. So it won't go into the wait-early mode.
+ */
+static inline bool pv_wait_early(struct pv_node *node, struct pv_node *prev,
+ int waitcnt, int loop)
+{
+ bool wait_early;
+
+ if (waitcnt || ((loop & PV_WAITHIST_MASK) != 0))
+ return false;
+
+ wait_early = PV_CAN_WAIT_EARLY(node) &&
+ (READ_ONCE(prev->state) != vcpu_running);
+ if (wait_early) {
+ if ((SPIN_THRESHOLD - loop < SPIN_THRESHOLD/2 ) &&
+ READ_ONCE(prev->mcs.locked))
+ return false;
+
+ pvstat_inc(pvstat_wait_early);
+ node->wait_early = 1;
+ }
+
+ return wait_early;
+}
+
+/*
* Initialize the PV part of the mcs_spinlock node.
*/
static void pv_init_node(struct mcs_spinlock *node)
@@ -332,6 +425,7 @@ static void pv_init_node(struct mcs_spinlock *node)

pn->cpu = smp_processor_id();
pn->state = vcpu_running;
+ pn->wait_early = 0;
}

/*
@@ -407,9 +501,10 @@ gotlock:
* pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
* behalf.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
int waitcnt = 0;
int loop;

@@ -417,6 +512,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
+ if (pv_wait_early(pn, pp, waitcnt, loop))
+ break;
cpu_relax();
}

@@ -432,6 +529,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
smp_store_mb(pn->state, vcpu_halted);

if (!READ_ONCE(node->locked)) {
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_node);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
@@ -449,12 +547,15 @@ static void pv_wait_node(struct mcs_spinlock *node)
break;
/*
* If the locked flag is still not set after wakeup, it is a
- * spurious wakeup and the vCPU should wait again. However,
- * there is a pretty high overhead for CPU halting and kicking.
+ * spurious wakeup unless it is in wait-early mode. As there
+ * is a pretty high overhead for CPU halting and kicking.
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
- pvstat_inc(pvstat_spurious);
+ if (!pn->wait_early)
+ pvstat_inc(pvstat_spurious);
+ else
+ pn->wait_early = 0; /* Reset wait-early mode */
}

/*
@@ -515,6 +616,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
if (READ_ONCE(pn->state) == vcpu_hashed)
lp = (struct qspinlock **)1;

+ pv_dec_waithist(pn); /* Pre-decremnt the wait_hist field */
for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -549,6 +651,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_head);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
--
1.7.1

2015-08-08 06:02:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Fri, Aug 07, 2015 at 11:17:56PM -0400, Waiman Long wrote:
> If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
> The extra check at the end of __pv_queued_spin_unlock() is unnecessary
> and so is removed.

This is half the patch it should be.

Because if the load is not needed, then the store is not either, and
then there's the comments to update.

2015-08-14 02:07:24

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v5 1/6] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On 08/08/2015 02:02 AM, Peter Zijlstra wrote:
> On Fri, Aug 07, 2015 at 11:17:56PM -0400, Waiman Long wrote:
>> If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
>> The extra check at the end of __pv_queued_spin_unlock() is unnecessary
>> and so is removed.
> This is half the patch it should be.
>
> Because if the load is not needed, then the store is not either, and
> then there's the comments to update.

Sorry for the late reply.


That is true. Setting state to vcpu_hashed in pv_wait_head() isn't
really necessary. I kept it there for consistency sake as the state may
be set to vcpu_hashed in pv_kick_node(). We can certainly take that out.

BTW, could you also review the other patches when you have time? I am
coming to the LinuxCon/Plumbers next week. Hopefully, I can chat with
you again.

Cheers,
Longman