2015-08-01 02:22:24

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 0/7] locking/qspinlock: Enhance pvqspinlock performance

v3->v4:
- Patch 1: add comment about possible racing condition in PV unlock.
- Patch 2: simplified the pv_pending_lock() function as suggested by
Davidlohr.
- Move PV unlock optimization patch forward to patch 4 & rerun
performance test.

v2->v3:
- Moved deferred kicking enablement patch forward & move back
the kick-ahead patch to make the effect of kick-ahead more visible.
- Reworked patch 6 to make it more readable.
- Reverted back to use state as a tri-state variable instead of
adding an additional bistate variable.
- Added performance data for different values of PV_KICK_AHEAD_MAX.
- Add a new patch to optimize PV unlock code path performance.

v1->v2:
- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.

Patch 1 simplifies the unlock code by doing unconditional vCPU kick
when _Q_SLOW_VAL is set as the chance of spurious wakeup showing
up in the statistical data that I collected was very low (1 or 2
occasionally).

Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.

Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 4 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 5 is an enablement patch for deferring vCPU kickings from the
lock side to the unlock side.

Patch 6 enables multiple vCPU kick-ahead's at unlock time, outside of
the critical section which can improve performance in overcommitted
guests and sometime even in normal guests.

Patch 7 enables adaptive spinning in the queue nodes. This patch can
lead to pretty big performance increase in over-committed guest at
the expense of a slight performance hit in normal guests.

Patches 2 & 4 improves performance of common uncontended and lightly
contended cases. Patches 5-7 are for improving performance in
over-committed VM guests.

Performance measurements were done on a 32-CPU Westmere-EX and
Haswell-EX systems. The Westmere-EX system got the most performance
gain from patch 5, whereas the Haswell-EX system got the most gain
from patch 6 for over-committed guests.

The table below shows the Linux kernel build times for various
values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on
the Westmere-EX system:

PV_KICK_AHEAD_MAX Patches 1-5 Patches 1-6
----------------- ----------- -----------
1 9m46.9s 11m10.1s
2 9m40.2s 10m08.3s
3 9m36.8s 9m49.8s
4 9m35.9s 9m38.7s
5 9m35.1s 9m33.0s
6 9m35.7s 9m28.5s

With patches 1-5, the performance wasn't very sensitive to different
PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes
the picture quite dramatically. There is a performance regression if
PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing
PV_KICK_AHEAD_MAX only gets us a minor benefit.

Waiman Long (7):
locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
locking/pvqspinlock: Add pending bit support
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock, x86: Optimize PV unlock code path
locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call
locking/pvqspinlock: Allow vCPUs kick-ahead
locking/pvqspinlock: Queue node adaptive spinning

arch/x86/Kconfig | 7 +
arch/x86/include/asm/qspinlock_paravirt.h | 59 +++
kernel/locking/qspinlock.c | 38 ++-
kernel/locking/qspinlock_paravirt.h | 546 +++++++++++++++++++++++++++--
4 files changed, 612 insertions(+), 38 deletions(-)

v3-to-v4 diff:

arch/x86/include/asm/qspinlock_paravirt.h | 5 +--
kernel/locking/qspinlock_paravirt.h | 45 +++++++++++++++++-----------
2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 46f0f82..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -9,10 +9,10 @@
*/
#ifdef CONFIG_64BIT

-PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
#define __pv_queued_spin_unlock __pv_queued_spin_unlock
#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
-#define PV_UNLOCK_SLOWPATH "__raw_callee_save_pv_queued_spin_unlock_slowpath"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save___pv_queued_spin_unlock_slowpath"

/*
* Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
@@ -46,7 +46,6 @@ asm(".pushsection .text;"
"jne .slowpath;"
"pop %rdx;"
"ret;"
- "nop;"
".slowpath: "
"push %rsi;"
"movzbl %al,%esi;"
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 56c3717..d04911b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -463,16 +463,16 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
/*
* wait for in-progress pending->locked hand-overs
*/
- if (val == _Q_PENDING_VAL) {
- while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
- loop--)
- cpu_relax();
+ while ((val == _Q_PENDING_VAL) && loop) {
+ cpu_relax();
+ val = atomic_read(&lock->val);
+ loop--;
}

/*
* trylock || pending
*/
- for (;;) {
+ for (;; loop--) {
if (val & ~_Q_LOCKED_MASK)
goto queue;
new = _Q_LOCKED_VAL;
@@ -481,7 +481,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
old = atomic_cmpxchg(&lock->val, val, new);
if (old == val)
break;
- if (loop-- <= 0)
+ if (!loop)
goto queue;
}

@@ -490,16 +490,18 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
/*
* We are pending, wait for the owner to go away.
*/
- while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
- && (loop-- > 0))
- cpu_relax();
-
- if (!(val & _Q_LOCKED_MASK)) {
- clear_pending_set_locked(lock);
- goto gotlock;
+ for (; loop; loop--, cpu_relax()) {
+ val = smp_load_acquire(&lock->val.counter);
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ pvstat_inc(pvstat_pend_lock);
+ goto gotlock;
+ }
}
+
/*
- * Clear the pending bit and fall back to queuing
+ * Fail to acquire the lock within the spinning period,
+ * so clear the pending bit and fall back to queuing.
*/
clear_pending(lock);
pvstat_inc(pvstat_pend_fail);
@@ -719,11 +721,11 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}

/*
- * PV version of the unlock fastpath and slowpath functions to be used
- * in stead of queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
*/
__visible void
-pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node, *next;
@@ -771,6 +773,13 @@ pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ *
+ * As smp_store_release() is not a full barrier, adding a check to
+ * the node->state doesn't guarantee the checking is really done
+ * after clearing the lock byte since they are in 2 separate
+ * cachelines and so hardware can reorder them. So either we insert
+ * memory barrier here and in the corresponding pv_wait_head()
+ * function or we do an unconditional kick which is what is done here.
*/
pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
@@ -803,6 +812,6 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
if (likely(lockval == _Q_LOCKED_VAL))
return;

- pv_queued_spin_unlock_slowpath(lock, lockval);
+ __pv_queued_spin_unlock_slowpath(lock, lockval);
}
#endif /* __pv_queued_spin_unlock */


2015-08-01 02:24:13

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

The smp_store_release() is not a full barrier. In order to avoid missed
wakeup, we may need to add memory barrier around locked and cpu state
variables adding to complexity. As the chance of spurious wakeup is very
low, it is easier and safer to just do an unconditional kick at unlock
time.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 15d3733..2dd4b39 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -240,7 +240,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}

- WRITE_ONCE(pn->state, vcpu_halted);
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
@@ -320,9 +319,15 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ *
+ * As smp_store_release() is not a full barrier, adding a check to
+ * the node->state doesn't guarantee the checking is really done
+ * after clearing the lock byte since they are in 2 separate
+ * cachelines and so hardware can reorder them. So either we insert
+ * memory barrier here and in the corresponding pv_wait_head()
+ * function or we do an unconditional kick which is what is done here.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
- pv_kick(node->cpu);
+ pv_kick(node->cpu);
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1

2015-08-01 02:22:52

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 2/7] locking/pvqspinlock: Add pending bit support

Like the native qspinlock, using the pending bit when it is lightly
loaded to acquire the lock is faster than going through the PV queuing
process which is even slower than the native queuing process. It also
avoids loading two additional cachelines (the MCS and PV nodes).

This patch adds the pending bit support for PV qspinlock. The pending
bit code has a smaller spin threshold (1<<10). It will default back
to the queuing method if it cannot acquired the lock within a certain
time limit.

On a VM with 32 vCPUs on a 32-core Westmere-EX box, the kernel
build times on 4.2-rc1 based kernels were:

Kernel Build Time Sys Time
------ ---------- --------
w/o patch 3m28.5s 28m17.5s
with patch 3m19.3s 23m55.7s

Using a locking microbenchmark on the same system, the locking
rates in (kops/s) were:

Threads Rate w/o patch Rate with patch
------- -------------- ---------------
2 (same socket) 6,515,265 7,077,476
2 (diff sockets) 2,967,145 4,353,851

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 27 +++++++++++++-
kernel/locking/qspinlock_paravirt.h | 67 +++++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+), 1 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..6518ee9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
}

+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->pending, 0);
+}
+
/*
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
@@ -193,6 +204,15 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
}

/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL, &lock->val);
+}
+
+/**
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
* @tail : The new queue tail code word
@@ -245,6 +265,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

#define pv_enabled() false
+#define pv_pending_lock(l, v) false

#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
@@ -286,8 +307,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

- if (pv_enabled())
+ if (pv_enabled()) {
+ if (pv_pending_lock(lock, val))
+ return; /* Got the lock via pending bit */
goto queue;
+ }

if (virt_queued_spin_lock(lock))
return;
@@ -463,6 +487,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_wait_node
#undef pv_kick_node
#undef pv_wait_head
+#undef pv_pending_lock

#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2dd4b39..5325877 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -22,6 +22,14 @@

#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

+/*
+ * Queued Spinlock Spin Threshold
+ *
+ * The vCPU will spin a relatively short time in pending mode before falling
+ * back to queuing.
+ */
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
@@ -152,6 +160,65 @@ static void pv_init_node(struct mcs_spinlock *node)
}

/*
+ * Try to acquire the lock and wait using the pending bit
+ */
+static int pv_pending_lock(struct qspinlock *lock, u32 val)
+{
+ int loop = PENDING_SPIN_THRESHOLD;
+ u32 new, old;
+
+ /*
+ * wait for in-progress pending->locked hand-overs
+ */
+ while ((val == _Q_PENDING_VAL) && loop) {
+ cpu_relax();
+ val = atomic_read(&lock->val);
+ loop--;
+ }
+
+ /*
+ * trylock || pending
+ */
+ for (;; loop--) {
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+ if (!loop)
+ goto queue;
+ }
+
+ if (new == _Q_LOCKED_VAL)
+ goto gotlock;
+ /*
+ * We are pending, wait for the owner to go away.
+ */
+ for (; loop; loop--, cpu_relax()) {
+ val = smp_load_acquire(&lock->val.counter);
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ goto gotlock;
+ }
+ }
+
+ /*
+ * Fail to acquire the lock within the spinning period,
+ * so clear the pending bit and fall back to queuing.
+ */
+ clear_pending(lock);
+
+queue:
+ return 0;
+
+gotlock:
+ return 1;
+}
+
+/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to wake the vcpu again.
*/
--
1.7.1

2015-08-01 02:22:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 3/7] locking/pvqspinlock: Collect slowpath lock statistics

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.

The measured latencies for different CPUs are:

CPU Wakeup Kicking
--- ------ -------
Haswell-EX 89.8us 7.4us
Westmere-EX 67.6us 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistics counts after a kernel build (no CPU overcommit)
was:

hash_hops_count=43
kick_latencies=5783565492
kick_time_count=640269
lock_kick_count=640238
pending_fail_count=10672
pending_lock_count=2946871
spurious_wakeup=14
unlock_kick_count=38
wait_again_count=4
wait_head_count=41
wait_node_count=640242
wake_latencies=42491684295

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++
kernel/locking/qspinlock_paravirt.h | 179 ++++++++++++++++++++++++++++++++++-
2 files changed, 182 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d..e89080b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -685,6 +685,13 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer Y.

+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+
source "arch/x86/xen/Kconfig"

config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5325877..3552aa9 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -44,6 +44,153 @@ struct pv_node {
};

/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_wait_again,
+ pvstat_kick_time,
+ pvstat_lock_kick,
+ pvstat_unlock_kick,
+ pvstat_pend_lock,
+ pvstat_pend_fail,
+ pvstat_spurious,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_again] = "wait_again_count",
+ [pvstat_kick_time] = "kick_time_count",
+ [pvstat_lock_kick] = "lock_kick_count",
+ [pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_pend_lock] = "pending_lock_count",
+ [pvstat_pend_fail] = "pending_fail_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg wake latency = pv_wake_latencies/kick_time_count
+ * Avg # of hops/hash = hash_hops_count/unlock_kick_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+ *pkick_time = 0;
+ pv_wait(ptr, val);
+ if (*pkick_time) {
+ atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_time);
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -103,10 +250,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;

for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -201,6 +351,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
val = smp_load_acquire(&lock->val.counter);
if (!(val & _Q_LOCKED_MASK)) {
clear_pending_set_locked(lock);
+ pvstat_inc(pvstat_pend_lock);
goto gotlock;
}
}
@@ -210,11 +361,13 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
* so clear the pending bit and fall back to queuing.
*/
clear_pending(lock);
+ pvstat_inc(pvstat_pend_fail);

queue:
return 0;

gotlock:
+ pvstat_inc(pvstat_pend_lock);
return 1;
}

@@ -225,9 +378,10 @@ gotlock:
static void pv_wait_node(struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ int waitcnt = 0;
int loop;

- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
@@ -245,14 +399,20 @@ static void pv_wait_node(struct mcs_spinlock *node)
*/
smp_store_mb(pn->state, vcpu_halted);

- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&pn->state, vcpu_halted);
+ }

/*
* Reset the vCPU state to avoid unncessary CPU kicking
*/
WRITE_ONCE(pn->state, vcpu_running);

+ if (READ_ONCE(node->locked))
+ break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -260,6 +420,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious);
}
/*
* By now our node->locked should be 1 and our caller will not actually
@@ -285,8 +446,10 @@ static void pv_kick_node(struct mcs_spinlock *node)
*
* See the comment in pv_wait_node().
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
+ if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
+ pvstat_inc(pvstat_lock_kick);
pv_kick(pn->cpu);
+ }
}

/*
@@ -298,9 +461,10 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct pv_node *pn = (struct pv_node *)node;
struct __qspinlock *l = (void *)lock;
struct qspinlock **lp = NULL;
+ int waitcnt = 0;
int loop;

- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
@@ -328,14 +492,20 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head);
+ if (waitcnt)
+ pvstat_inc(pvstat_wait_again);
pv_wait(&l->locked, _Q_SLOW_VAL);

+ if (!READ_ONCE(l->locked))
+ return;
/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
* wakeup and so the vCPU should wait again after spinning for
* a while.
*/
+ pvstat_inc(pvstat_spurious);
}

/*
@@ -394,6 +564,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* memory barrier here and in the corresponding pv_wait_head()
* function or we do an unconditional kick which is what is done here.
*/
+ pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
}
/*
--
1.7.1

2015-08-01 02:22:54

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 4/7] locking/pvqspinlock, x86: Optimize PV unlock code path

The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:
1) Moving the unlock slowpath code from the fastpath into a separate
__pv_queued_spin_unlock_slowpath() function to make the fastpath
as simple as possible..
2) For x86-64, hand-coded an assembly function to combine the register
saving thunk code with the fastpath code. Only registers that
are used in the fastpath will be saved and restored. If the
fastpath fails, the slowpath function will be called via another
PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
__pv_queued_spin_unlock() code as the thunk saves and restores
only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

Threads Before patch After patch % Change
------- ------------ ----------- --------
1 134.1 ms 119.1 ms -11%
2 1163 ms 967 ms -17%
3 3641 ms 3298 ms -9.4%
4 4051 ms 3634 ms -10.3%

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock_paravirt.h | 59 +++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 31 ++++++++++-----
2 files changed, 80 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
#ifndef __ASM_QSPINLOCK_PARAVIRT_H
#define __ASM_QSPINLOCK_PARAVIRT_H

+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock __pv_queued_spin_unlock
+#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ * struct __qspinlock *l = (void *)lock;
+ * u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ * if (likely(lockval == _Q_LOCKED_VAL))
+ * return;
+ * pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ * rdi = lock (first argument)
+ * rsi = lockval (second argument)
+ * rdx = internal variable (set to 0)
+ */
+asm(".pushsection .text;"
+ ".globl " PV_UNLOCK ";"
+ ".align 4,0x90;"
+ PV_UNLOCK ": "
+ "push %rdx;"
+ "mov $0x1,%eax;"
+ "xor %edx,%edx;"
+ "lock cmpxchg %dl,(%rdi);"
+ "cmp $0x1,%al;"
+ "jne .slowpath;"
+ "pop %rdx;"
+ "ret;"
+ ".slowpath: "
+ "push %rsi;"
+ "movzbl %al,%esi;"
+ "call " PV_UNLOCK_SLOWPATH ";"
+ "pop %rsi;"
+ "pop %rdx;"
+ "ret;"
+ ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+ ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);

+#endif /* CONFIG_64BIT */
#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 3552aa9..5efcc65 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -516,23 +516,20 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}

/*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
*/
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node;
- u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);

/*
* We must not unlock if SLOW, because in that case we must first
* unhash. Otherwise it would be possible to have multiple @lock
* entries, which would be BAD.
*/
- if (likely(lockval == _Q_LOCKED_VAL))
- return;
-
if (unlikely(lockval != _Q_SLOW_VAL)) {
if (debug_locks_silent)
return;
@@ -567,12 +564,26 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
}
+
/*
* Include the architecture specific callee-save thunk of the
* __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
*/
#include <asm/qspinlock_paravirt.h>

+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+
+ if (likely(lockval == _Q_LOCKED_VAL))
+ return;
+
+ __pv_queued_spin_unlock_slowpath(lock, lockval);
+}
+#endif /* __pv_queued_spin_unlock */
--
1.7.1

2015-08-01 02:22:55

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 5/7] locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call

Most of the vCPU kickings are done on the locking side where the new
lock holder wake up the queue head vCPU to spin on the lock. However,
there are situations where it may be advantageous to defer the vCPU
kicking to when the lock holder releases the lock.

This patch enables the deferment of vCPU kicking to the unlock function
by adding a new vCPU state (vcpu_hashed) to marks the fact that
1) _Q_SLOW_VAL is set in the lock, and
2) the pv_node address is stored in the hash table

This enablement patch, by itself, should not change the performance
of the pvqspinlock code. Actual deferment vCPU kicks will be added
in a later patch.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 6 +++---
kernel/locking/qspinlock_paravirt.h | 34 ++++++++++++++++++++++++----------
2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6518ee9..94fdd27 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -259,8 +259,8 @@ static __always_inline void set_locked(struct qspinlock *lock)

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

@@ -464,7 +464,7 @@ queue:
cpu_relax();

arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);

release:
/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5efcc65..5e140fe 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -33,6 +33,7 @@
enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+ vcpu_hashed, /* vcpu_halted + node stored in hash table */
};

struct pv_node {
@@ -406,13 +407,17 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(&pn->state, vcpu_halted);
}

+ if (READ_ONCE(node->locked))
+ break;
+
/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * Reset the vCPU state to running to avoid unncessary CPU
+ * kicking unless vcpu_hashed had already been set. In this
+ * case, node->locked should have just been set, and we
+ * aren't going to set state to vcpu_halted again.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);

- if (READ_ONCE(node->locked))
- break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -431,12 +436,16 @@ static void pv_wait_node(struct mcs_spinlock *node)

/*
* Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * pv_wait_node(). Alternatively, it can also defer the kicking to the
+ * unlock function.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;

+ if (xchg(&pn->state, vcpu_running) != vcpu_halted)
+ return;
+
/*
* Note that because node->locked is already set, this actual
* mcs_spinlock entry could be re-used already.
@@ -446,10 +455,8 @@ static void pv_kick_node(struct mcs_spinlock *node)
*
* See the comment in pv_wait_node().
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted) {
- pvstat_inc(pvstat_lock_kick);
- pv_kick(pn->cpu);
- }
+ pvstat_inc(pvstat_lock_kick);
+ pv_kick(pn->cpu);
}

/*
@@ -471,6 +478,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}

+ if (!lp && (xchg(&pn->state, vcpu_hashed) == vcpu_hashed))
+ /*
+ * The hashed table & _Q_SLOW_VAL had been filled
+ * by the lock holder.
+ */
+ lp = (struct qspinlock **)-1;
+
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
/*
--
1.7.1

2015-08-01 02:22:57

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 6/7] locking/pvqspinlock: Allow vCPUs kick-ahead

Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
critical section and block forward progress. This patch implements
a kick-ahead mechanism where the unlocker will kick the queue head
vCPUs as well as up to four additional vCPUs next to the queue head
if they were halted. The kickings are done after exiting the critical
section to improve parallelism.

The amount of kick-ahead allowed depends on the number of vCPUs
in the VM guest. Currently it allows up to 1 vCPU kick-ahead per
4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX (4). There
are diminishing returns in increasing the maximum value. The current
value of 4 is a compromise of getting a nice performance boost without
penalizing too much on the one vCPU that is doing all the kickings.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m42.3s 10m27.5s 2m00.7s 17m22.2s
After patch 3m02.3s 9m35.9s 1m59.8s 16m57.6s

This improves performance quite substantially on Westmere, but not
so much on Haswell.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 108 ++++++++++++++++++++++++++++++++---
1 files changed, 99 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5e140fe..c4cc631 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -54,6 +54,7 @@ enum pv_qlock_stat {
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
+ pvstat_kick_ahead,
pvstat_pend_lock,
pvstat_pend_fail,
pvstat_spurious,
@@ -75,6 +76,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
+ [pvstat_kick_ahead] = "kick_ahead_count",
[pvstat_pend_lock] = "pending_lock_count",
[pvstat_pend_fail] = "pending_fail_count",
[pvstat_spurious] = "spurious_wakeup",
@@ -87,7 +89,8 @@ static atomic_t pvstats[pvstat_num];
* pv_kick_latencies = sum of all pv_kick latencies in ns
* pv_wake_latencies = sum of all wakeup latencies in ns
*
- * Avg kick latency = pv_kick_latencies/(lock_kick_count + unlock_kick_count)
+ * Avg kick latency = pv_kick_latencies/
+ * (lock_kick_count + unlock_kick_count + kick_ahead_count)
* Avg wake latency = pv_wake_latencies/kick_time_count
* Avg # of hops/hash = hash_hops_count/unlock_kick_count
*/
@@ -219,6 +222,18 @@ static struct pv_hash_entry *pv_lock_hash;
static unsigned int pv_lock_hash_bits __read_mostly;

/*
+ * Allow kick-ahead of vCPUs at unlock time
+ *
+ * The pv_kick_ahead value is set by a simple formula that 1 vCPU kick-ahead
+ * is allowed per 4 vCPUs available up to a maximum of PV_KICK_AHEAD_MAX.
+ * There are diminishing returns in increasing PV_KICK_AHEAD_MAX. The current
+ * value of 4 is a good compromise that gives a good performance boost without
+ * penalizing the vCPU that is doing the kicking by too much.
+ */
+#define PV_KICK_AHEAD_MAX 4
+static int pv_kick_ahead __read_mostly;
+
+/*
* Allocate memory for the PV qspinlock hash buckets
*
* This function should be called from the paravirt spinlock initialization
@@ -226,7 +241,8 @@ static unsigned int pv_lock_hash_bits __read_mostly;
*/
void __init __pv_init_lock_hash(void)
{
- int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+ int ncpus = num_possible_cpus();
+ int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);

if (pv_hash_size < PV_HE_MIN)
pv_hash_size = PV_HE_MIN;
@@ -240,6 +256,13 @@ void __init __pv_init_lock_hash(void)
pv_hash_size, 0, HASH_EARLY,
&pv_lock_hash_bits, NULL,
pv_hash_size, pv_hash_size);
+ /*
+ * Enable the unlock kick ahead mode according to the number of
+ * vCPUs available.
+ */
+ pv_kick_ahead = min(ncpus/4, PV_KICK_AHEAD_MAX);
+ if (pv_kick_ahead)
+ pr_info("PV unlock kick ahead max count = %d\n", pv_kick_ahead);
}

#define for_each_hash_entry(he, offset, hash) \
@@ -435,6 +458,28 @@ static void pv_wait_node(struct mcs_spinlock *node)
}

/*
+ * Helper to get the address of the next kickable node
+ *
+ * The node has to be in the halted state. If the chkonly flag is set,
+ * the CPU state won't be changed. Otherwise, it will be transitioned to
+ * running state by this function. If no kickable node is found, NULL
+ * will be returned.
+ */
+static inline struct pv_node *
+pv_get_kick_node(struct pv_node *node, int chkonly)
+{
+ struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
+
+ if (!next || (READ_ONCE(next->state) != vcpu_halted))
+ return NULL;
+
+ if (!chkonly && (xchg(&next->state, vcpu_running) != vcpu_halted))
+ next = NULL; /* No kicking is needed */
+
+ return next;
+}
+
+/*
* Called after setting next->locked = 1, used to wake those stuck in
* pv_wait_node(). Alternatively, it can also defer the kicking to the
* unlock function.
@@ -447,14 +492,36 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
return;

/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * Kicking the next node at lock time can actually be a bit faster
+ * than doing it at unlock time because the critical section time
+ * overlaps with the wakeup latency of the next node. However, if the
+ * VM is too overcommmitted, it can happen that we need to kick the
+ * CPU again at unlock time (double-kick). To avoid that and also to
+ * fully utilize the kick-ahead functionality at unlock time,
+ * the kicking will be deferred under either one of the following
+ * 2 conditions:
+ * 1) The VM guest has too few vCPUs that kick-ahead is not even
+ * enabled. In this case, the chance of double-kick will be
+ * higher.
+ * 2) The node after the next one is also in the halted state.
*
- * This should be fine however, kicking people for no reason is
- * harmless.
- *
- * See the comment in pv_wait_node().
+ * In this case, the state is set to vcpu_hased to indicate that hash
+ * table has been filled and _Q_SLOW_VAL is set.
*/
+ if (!pv_kick_ahead || pv_get_kick_node(pn, 1)) {
+ if (xchg(&pn->state, vcpu_hashed) != vcpu_hashed) {
+ struct __qspinlock *l = (void *)lock;
+
+ /*
+ * As this is the same vCPU that will check the
+ * _Q_SLOW_VAL value and the hash table later on at
+ * unlock time, no atomic instruction is needed.
+ */
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
+ }
+ return;
+ }
pvstat_inc(pvstat_lock_kick);
pv_kick(pn->cpu);
}
@@ -537,7 +604,8 @@ __visible void
__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
{
struct __qspinlock *l = (void *)lock;
- struct pv_node *node;
+ struct pv_node *node, *next;
+ int i, nr_kick, cpus[PV_KICK_AHEAD_MAX];

/*
* We must not unlock if SLOW, because in that case we must first
@@ -559,6 +627,20 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
node = pv_unhash(lock);

/*
+ * Implement unlock kick-ahead
+ *
+ * Access the next group of nodes, if available, and prepare to kick
+ * them after releasing the lock if they are in the halted state. This
+ * should improve performance on an overcommitted system.
+ */
+ for (nr_kick = 0, next = node; nr_kick < pv_kick_ahead; nr_kick++) {
+ next = pv_get_kick_node(next, 0);
+ if (!next)
+ break;
+ cpus[nr_kick] = next->cpu;
+ }
+
+ /*
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
@@ -577,6 +659,14 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
*/
pvstat_inc(pvstat_unlock_kick);
pv_kick(node->cpu);
+
+ /*
+ * Kick the next group of vCPUs, if available.
+ */
+ for (i = 0; i < nr_kick; i++) {
+ pvstat_inc(pvstat_kick_ahead);
+ pv_kick(cpus[i]);
+ }
}

/*
--
1.7.1

2015-08-01 02:22:58

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 7/7] locking/pvqspinlock: Queue node adaptive spinning

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

Reducing the spinning threshold is found to improve performance in
an overcommitted VM guest, but decrease performance when there is
no overcommittment.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() earlier if all the following conditions are true:

1) the vCPU has not been halted before;
2) the previous vCPU is in the halted state;
3) the current vCPU is at least 2 nodes away from the lock holder;
4) there are a lot of pv_wait() for the current vCPU recently.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 9m35.9s 1m59.8s 16m57.6s
After patch 3m06.5s 9m38.7s 2m01.5s 9m42.3s

This patch seemed to cause a little bit of performance degraduation
for 32 vCPUs. For 48 vCPUs, there wasn't much change for Westmere,
but a pretty big performance jump for Haswell.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 5 +-
kernel/locking/qspinlock_paravirt.h | 132 +++++++++++++++++++++++++++++++++-
2 files changed, 131 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 94fdd27..da39d43 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -258,7 +258,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline void __pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
@@ -415,7 +416,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);

- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index c4cc631..d04911b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,12 +23,47 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

/*
- * Queued Spinlock Spin Threshold
+ * Queued Spinlock Spin Thresholds
*
* The vCPU will spin a relatively short time in pending mode before falling
* back to queuing.
+ *
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will spin less if the following conditions are all true:
+ * 1) vCPU in the previous node is halted && it has not been halted before
+ * 2) it is at least 2 nodes away from the lock holder
+ * 3) there is a lot of pv_wait() in the curent vCPU recently
+ *
+ * The last condition is being monitored by the waithist field in the pv_node
+ * structure which tracks the history of pv_wait() relative to slowpath calls.
+ * Each pv_wait will increment this field by PV_WAIT_INC until it exceeds
+ * PV_WAITHIST_MAX. Each slowpath lock call will decrement it by 1 until it
+ * reaches PV_WAITHIST_MIN. If its value is higher than PV_WAITHIST_THRESHOLD,
+ * the vCPU will spin less. The reason for this adaptive spinning is to try
+ * to enable wait-early when overcommitted which should cause a lot more
+ * pv_wait's, but don't use it when it is not.
+ *
+ * The queue node vCPU will monitor the state of the previous node
+ * periodically to see if there is any change. If the previous node is
+ * found to be halted, it will call pv_wait() immediately when wait_early
+ * mode is enabled as the wakeup latency is pretty high. On the other, it
+ * won't go to the halted state immediately on entry to pv_wait_node() as
+ * the previous node may be being woken up.
+ *
+ * With PV_WAIT_INC set to 2, each pv_wait() while not in wait-early mode
+ * will increment waithist by 1. Each slowpath call without pv_wait() will
+ * decrement waithist by 1. The threshold is set in a way as to not prefer
+ * enabling wait-early.
*/
-#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD
+#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 5)
+#define QNODE_SPIN_CHECK_MASK 0xff
+#define PV_WAIT_INC 2
+#define PV_WAITHIST_MIN 1
+#define PV_WAITHIST_MAX 20
+#define PV_WAITHIST_THRESHOLD 15

enum vcpu_state {
vcpu_running = 0,
@@ -42,6 +77,8 @@ struct pv_node {

int cpu;
u8 state;
+ u8 waithist;
+ u8 wait_early;
};

/*
@@ -51,6 +88,7 @@ enum pv_qlock_stat {
pvstat_wait_head,
pvstat_wait_node,
pvstat_wait_again,
+ pvstat_wait_early,
pvstat_kick_time,
pvstat_lock_kick,
pvstat_unlock_kick,
@@ -73,6 +111,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_head] = "wait_head_count",
[pvstat_wait_node] = "wait_node_count",
[pvstat_wait_again] = "wait_again_count",
+ [pvstat_wait_early] = "wait_early_count",
[pvstat_kick_time] = "kick_time_count",
[pvstat_lock_kick] = "lock_kick_count",
[pvstat_unlock_kick] = "unlock_kick_count",
@@ -321,6 +360,86 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
}

/*
+ * Helper functions to maintain the waithist field as well as returning the
+ * right spinning threshold and next spin count for queue nodes.
+ */
+static inline void pv_inc_waithist(struct pv_node *node)
+{
+ u8 wait_early = node->wait_early;
+
+ if (wait_early)
+ pvstat_inc(pvstat_wait_early);
+
+ /*
+ * pv_wait() in wait_early mode doesn't count as much as !wait_early
+ */
+ if (node->waithist < PV_WAITHIST_MAX)
+ node->waithist += wait_early ? 1 : PV_WAIT_INC;
+}
+
+static inline void pv_dec_waithist(struct pv_node *node)
+{
+ node->wait_early = 0;
+ if (node->waithist > PV_WAITHIST_MIN)
+ node->waithist--;
+}
+
+static inline int pv_spin_threshold(struct pv_node *node,
+ struct pv_node *prev, int waitcnt)
+{
+ node->wait_early = 0;
+
+ /*
+ * Don't wait early if node has been halted before or the waithist
+ * threshold has not been reached.
+ */
+ if (waitcnt || (node->waithist <= PV_WAITHIST_THRESHOLD))
+ return QNODE_SPIN_THRESHOLD;
+
+ /*
+ * Don't wait early if previous node is a queue head or is running
+ */
+ node->wait_early = !READ_ONCE(prev->mcs.locked) &&
+ (READ_ONCE(prev->state) != vcpu_running);
+
+ return node->wait_early ? QNODE_SPIN_THRESHOLD_SHORT
+ : QNODE_SPIN_THRESHOLD;
+}
+
+static inline int pv_next_spin_count(struct pv_node *node, struct pv_node *prev,
+ int waitcnt, int loop)
+{
+ int wait_early;
+
+ /*
+ * Don't need to make any check if
+ * 1) node has been halted before
+ * 2) it is not time to check yet
+ * 3) wait early is not enabled
+ */
+ if (waitcnt || ((loop & QNODE_SPIN_CHECK_MASK) != 1) ||
+ (node->waithist <= PV_WAITHIST_THRESHOLD))
+ return loop - 1;
+
+ /*
+ * Look for state transition at previous node.
+ *
+ * running => halted:
+ * call pv_wait() now
+ *
+ * halted => running:
+ * reset spin threshold to QNODE_SPIN_THRESHOLD
+ */
+ wait_early = !READ_ONCE(prev->mcs.locked) &&
+ (READ_ONCE(prev->state) != vcpu_running);
+ if (node->wait_early == wait_early)
+ return loop - 1;
+
+ node->wait_early = wait_early;
+ return wait_early ? 0 : QNODE_SPIN_THRESHOLD;
+}
+
+/*
* Initialize the PV part of the mcs_spinlock node.
*/
static void pv_init_node(struct mcs_spinlock *node)
@@ -399,14 +518,16 @@ gotlock:
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_kick_node() is used to wake the vcpu again.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
int waitcnt = 0;
int loop;

for (;; waitcnt++) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ for (loop = pv_spin_threshold(pn, pp, waitcnt); loop;
+ loop = pv_next_spin_count(pn, pp, waitcnt, loop)) {
if (READ_ONCE(node->locked))
return;
cpu_relax();
@@ -424,6 +545,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
smp_store_mb(pn->state, vcpu_halted);

if (!READ_ONCE(node->locked)) {
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_node);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
@@ -538,6 +660,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
int waitcnt = 0;
int loop;

+ pv_dec_waithist(pn); /* Pre-decremnt the waithist field */
for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -573,6 +696,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pv_inc_waithist(pn);
pvstat_inc(pvstat_wait_head);
if (waitcnt)
pvstat_inc(pvstat_wait_again);
--
1.7.1

2015-08-01 18:01:33

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Fri, 2015-07-31 at 22:21 -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid missed
> wakeup, we may need to add memory barrier around locked and cpu state
> variables adding to complexity. As the chance of spurious wakeup is very
> low, it is easier and safer to just do an unconditional kick at unlock
> time.
>
> Signed-off-by: Waiman Long <[email protected]>

Please keep tags from previous versions ;)

2015-08-01 20:16:09

by Waiman Long

[permalink] [raw]
Subject: RE: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

Davidlohr,

I am sorry that I forgot to put in your tag.

Cheers,
Longman

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Davidlohr Bueso
Sent: Saturday, August 01, 2015 2:01 PM
To: Long, Wai Man
Cc: Peter Zijlstra; Ingo Molnar; Thomas Gleixner; H. Peter Anvin; [email protected]; [email protected]; Norton, Scott J; Hatch, Douglas B (HPS Linux PM)
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Fri, 2015-07-31 at 22:21 -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid
> missed wakeup, we may need to add memory barrier around locked and cpu
> state variables adding to complexity. As the chance of spurious wakeup
> is very low, it is easier and safer to just do an unconditional kick
> at unlock time.
>
> Signed-off-by: Waiman Long <[email protected]>

Please keep tags from previous versions ;)

2015-08-01 22:29:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Fri, Jul 31, 2015 at 10:21:58PM -0400, Waiman Long wrote:
> The smp_store_release() is not a full barrier. In order to avoid missed
> wakeup, we may need to add memory barrier around locked and cpu state
> variables adding to complexity. As the chance of spurious wakeup is very
> low, it is easier and safer to just do an unconditional kick at unlock
> time.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/locking/qspinlock_paravirt.h | 11 ++++++++---
> 1 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index 15d3733..2dd4b39 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -240,7 +240,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> cpu_relax();
> }
>
> - WRITE_ONCE(pn->state, vcpu_halted);
> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> /*
> @@ -320,9 +319,15 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
> /*
> * At this point the memory pointed at by lock can be freed/reused,
> * however we can still use the pv_node to kick the CPU.
> + *
> + * As smp_store_release() is not a full barrier, adding a check to
> + * the node->state doesn't guarantee the checking is really done
> + * after clearing the lock byte

This is true, but _WHY_ is that a problem ?

since they are in 2 separate
> + * cachelines and so hardware can reorder them.

That's just gibberish, even in the same cacheline stuff can get
reordered.

So either we insert
> + * memory barrier here and in the corresponding pv_wait_head()
> + * function or we do an unconditional kick which is what is done here.

why, why why ? You've added words, but you've not actually described
what the problem is you're trying to fix.

AFAICT the only thing we really care about here is that the load in
question happens _after_ we observe SLOW, and that is still true.

The order against the unlock is irrelevant.

So we set ->state before we hash and before we set SLOW. Given that
we've seen SLOW, we must therefore also see ->state.

If ->state == halted, this means the CPU in question is blocked and the
pv_node will not get re-used -- if it does get re-used, it wasn't
blocked and we don't care either.

Therefore, ->cpu is stable and we'll kick it into action.

How do you end up not waking a waiting cpu? Explain that.

> */
> - if (READ_ONCE(node->state) == vcpu_halted)
> - pv_kick(node->cpu);
> + pv_kick(node->cpu);
> }

Also, this patch clearly isn't against my tree.

2015-08-03 18:22:24

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Sun, 2015-08-02 at 00:29 +0200, Peter Zijlstra wrote:
> That's just gibberish, even in the same cacheline stuff can get
> reordered.

true dat

>
> So either we insert
> > + * memory barrier here and in the corresponding pv_wait_head()
> > + * function or we do an unconditional kick which is what is done here.
>
> why, why why ? You've added words, but you've not actually described
> what the problem is you're trying to fix.
>
> AFAICT the only thing we really care about here is that the load in
> question happens _after_ we observe SLOW, and that is still true.
>
> The order against the unlock is irrelevant.
>
> So we set ->state before we hash and before we set SLOW. Given that
> we've seen SLOW, we must therefore also see ->state.
>
> If ->state == halted, this means the CPU in question is blocked and the
> pv_node will not get re-used -- if it does get re-used, it wasn't
> blocked and we don't care either.

Right, if it does get re-used, we were burning SPIN_THRESHOLD and racing
only wastes a few spins, afaict. In fact this is explicitly stated:

/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
* wakeup and so the vCPU should wait again after spinning for
* a while.
*/

The thing I like about this patch is that it simplifies the
pv_kick/pv_wait flow, not having to depend on minutia like ->state
checking. But the condition about spurious wakeups is already there, so
really nothing changes.

Thanks,
Davidlohr

2015-08-03 18:37:16

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] locking/pvqspinlock: Add pending bit support

On Fri, 2015-07-31 at 22:21 -0400, Waiman Long wrote:
> /*
> + * Try to acquire the lock and wait using the pending bit
> + */
> +static int pv_pending_lock(struct qspinlock *lock, u32 val)

Sorry but, why did yo not rewrite the function as we had previously
discussed. This is very confusing to read, the one I suggested follows a
much nicer flow and purposely illustrates the intention. You also failed
to address my loop semantics concerns altogether.

Thanks,
Davidlohr

2015-08-03 18:37:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Mon, Aug 03, 2015 at 11:22:09AM -0700, Davidlohr Bueso wrote:
> On Sun, 2015-08-02 at 00:29 +0200, Peter Zijlstra wrote:
> > That's just gibberish, even in the same cacheline stuff can get
> > reordered.
>
> true dat
>
> >
> > So either we insert
> > > + * memory barrier here and in the corresponding pv_wait_head()
> > > + * function or we do an unconditional kick which is what is done here.
> >
> > why, why why ? You've added words, but you've not actually described
> > what the problem is you're trying to fix.
> >
> > AFAICT the only thing we really care about here is that the load in
> > question happens _after_ we observe SLOW, and that is still true.
> >
> > The order against the unlock is irrelevant.
> >
> > So we set ->state before we hash and before we set SLOW. Given that
> > we've seen SLOW, we must therefore also see ->state.
> >
> > If ->state == halted, this means the CPU in question is blocked and the
> > pv_node will not get re-used -- if it does get re-used, it wasn't
> > blocked and we don't care either.
>
> Right, if it does get re-used, we were burning SPIN_THRESHOLD and racing
> only wastes a few spins, afaict. In fact this is explicitly stated:
>
> /*
> * The unlocker should have freed the lock before kicking the
> * CPU. So if the lock is still not free, it is a spurious
> * wakeup and so the vCPU should wait again after spinning for
> * a while.
> */
>
> The thing I like about this patch is that it simplifies the
> pv_kick/pv_wait flow, not having to depend on minutia like ->state
> checking. But the condition about spurious wakeups is already there, so
> really nothing changes.

OK, so there's no 'fix'? The patch claims we can loose a wakeup and I
just don't see how that is true.

2015-08-03 19:09:56

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Mon, 2015-08-03 at 20:37 +0200, Peter Zijlstra wrote:
> OK, so there's no 'fix'? The patch claims we can loose a wakeup and I
> just don't see how that is true.

Taking another look, I think you could hit something like this:

CPU0 (lock): CPU1 (unlock):
pv_wait_head __pv_queued_spin_unlock
<load ->state> [bogus ->state != halted]
<spin> smp_store_release(&l->locked, 0);

WRITE_ONCE(pn->state, vcpu_halted);
pv_wait(&l->locked, _Q_SLOW_VAL); if (->state == vcpu_halted)
pv_kick(node->cpu); <-- missing wakeup, never called

So basically you can miss a wakeup if node->state load is done while the
locking thread is spinning and hasn't gotten a chance to update the
state to halted. That would also imply that it occurs right when the
threshold limit is about to be reached.

2015-08-03 19:56:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On Mon, Aug 03, 2015 at 12:09:42PM -0700, Davidlohr Bueso wrote:
> On Mon, 2015-08-03 at 20:37 +0200, Peter Zijlstra wrote:
> > OK, so there's no 'fix'? The patch claims we can loose a wakeup and I
> > just don't see how that is true.
>
> Taking another look, I think you could hit something like this:
>
> CPU0 (lock): CPU1 (unlock):
> pv_wait_head __pv_queued_spin_unlock
> <load ->state> [bogus ->state != halted]

I don't think this can happen, see below, IF you take the slow path, you
_must_ see halted.

> <spin> smp_store_release(&l->locked, 0);
>
> WRITE_ONCE(pn->state, vcpu_halted);
> pv_wait(&l->locked, _Q_SLOW_VAL); if (->state == vcpu_halted)
> pv_kick(node->cpu); <-- missing wakeup, never called
>
> So basically you can miss a wakeup if node->state load is done while the
> locking thread is spinning and hasn't gotten a chance to update the
> state to halted. That would also imply that it occurs right when the
> threshold limit is about to be reached.

pv_wait_head() __pv_queued_spin_unlock()

[S] node->state = halted
[S] hash(lock, node)
MB
[S] ->locked = SLOW
MB
[L] ->locked == SLOW
RMB
[L] node = unhash(lock)
[L] node->state == halted
RELEASE
[S] ->locked = 0

kick(node->cpu)
CLI
[L] ->locked

If we don't see SLOW, nothing to be done. If we do, we _must_ then also
see ->state == halted and will kick.

And note that the load of node->state _cannot_ be pushed up further, it
depends on the load of node, which in turn depends on the load of
->locked.

So I'm still not seeing it. You cannot miss a kick.

2015-08-03 21:30:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] locking/pvqspinlock: Add pending bit support

On 08/03/2015 02:37 PM, Davidlohr Bueso wrote:
> On Fri, 2015-07-31 at 22:21 -0400, Waiman Long wrote:
>> /*
>> + * Try to acquire the lock and wait using the pending bit
>> + */
>> +static int pv_pending_lock(struct qspinlock *lock, u32 val)
> Sorry but, why did yo not rewrite the function as we had previously
> discussed. This is very confusing to read, the one I suggested follows a
> much nicer flow and purposely illustrates the intention. You also failed
> to address my loop semantics concerns altogether.
>
> Thanks,
> Davidlohr
>

I am sorry that I might have misinterpret what you wanted. Right now,
the latest code have 3 loops in the pending function:
1. Waiting for pending locker to become lock holder
2. A loop to do the trylock or set the pending bit.
3. With the pending bit set, another loop to wait until the lock holder
frees the lock.

The 2nd loop may be a bit confusing to look at. I will try to add more
comment to clarify that. The second loop can return without calling
clear_pending() because the pending bit will not be set until it breaks
out of the loop.

I think it will make the code more complicated if we try to merge the
2nd and 3rd loops.

Please let me know what kind of rewriting you have in mind.

Thanks,
Longman



2015-08-04 03:26:10

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL

On 08/01/2015 06:29 PM, Peter Zijlstra wrote:
> On Fri, Jul 31, 2015 at 10:21:58PM -0400, Waiman Long wrote:
>> The smp_store_release() is not a full barrier. In order to avoid missed
>> wakeup, we may need to add memory barrier around locked and cpu state
>> variables adding to complexity. As the chance of spurious wakeup is very
>> low, it is easier and safer to just do an unconditional kick at unlock
>> time.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/locking/qspinlock_paravirt.h | 11 ++++++++---
>> 1 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>> index 15d3733..2dd4b39 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -240,7 +240,6 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> cpu_relax();
>> }
>>
>> - WRITE_ONCE(pn->state, vcpu_halted);
>> if (!lp) { /* ONCE */
>> lp = pv_hash(lock, pn);
>> /*
>> @@ -320,9 +319,15 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
>> /*
>> * At this point the memory pointed at by lock can be freed/reused,
>> * however we can still use the pv_node to kick the CPU.
>> + *
>> + * As smp_store_release() is not a full barrier, adding a check to
>> + * the node->state doesn't guarantee the checking is really done
>> + * after clearing the lock byte
> This is true, but _WHY_ is that a problem ?
>
> since they are in 2 separate
>> + * cachelines and so hardware can reorder them.
> That's just gibberish, even in the same cacheline stuff can get
> reordered.
>
> So either we insert
>> + * memory barrier here and in the corresponding pv_wait_head()
>> + * function or we do an unconditional kick which is what is done here.
> why, why why ? You've added words, but you've not actually described
> what the problem is you're trying to fix.
>
> AFAICT the only thing we really care about here is that the load in
> question happens _after_ we observe SLOW, and that is still true.
>
> The order against the unlock is irrelevant.
>
> So we set ->state before we hash and before we set SLOW. Given that
> we've seen SLOW, we must therefore also see ->state.
>
> If ->state == halted, this means the CPU in question is blocked and the
> pv_node will not get re-used -- if it does get re-used, it wasn't
> blocked and we don't care either.
>
> Therefore, ->cpu is stable and we'll kick it into action.
>
> How do you end up not waking a waiting cpu? Explain that.
>

Yes, it is safe in the current code. In some versions of my pvqspinlock
patch, I was resetting the state back to running in pv_wait_head(). This
causes race problem.

The current code, however, will not reset the state back to running and
so the check is redundant. I will clarify that in the next patch.

>> */
>> - if (READ_ONCE(node->state) == vcpu_halted)
>> - pv_kick(node->cpu);
>> + pv_kick(node->cpu);
>> }
> Also, this patch clearly isn't against my tree.
>

Yes, I was backing against the latest tip tree. As some of the files in
the patch were modified in the latest tip tree, I will rebase my patch
and update it.

Please let me know if I should be using your tree instead.

Cheers,
Longman