2015-07-11 20:37:20

by Waiman Long

[permalink] [raw]
Subject: [PATCH 0/7] locking/qspinlock: Enhance pvqspinlock & introduce queued unfair lock

This patchset consists of two parts:

1) Patches 1-5 enhance the performance of PV qspinlock especially for
overcommitted guest. The first patch moves all the CPU kicking to
the unlock code. The 2nd and 3rd patches implement a kick-ahead
and wait-early mechanism that was shown to improve performance for
overcommitted guest. They are inspired by the "Do Virtual Machines
Really Scale?" blog from Sanidhya Kashyap. The 4th patch adds
code to collect PV qspinlock statistics. The last patch adds the
pending bit support to PV qspinlock to improve performance at
light load. This is important as the PV queuing code has even
higher overhead than the native queuing code.

2) Patches 6 introduces queued unfair lock as a replacement of the
existing unfair byte lock. The queued unfair lock is fairer
than the byte lock currently in the qspinlock while improving
performance at high contention level. Patch 7 adds a kernel
command line option to KVM for disabling PV spinlock, similar to
the one in Xen, if the administrators choose to do so. The last
patch adds statistics collection to the queued unfair lock code.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. So both systems have 32 physical CPUs. VM guests
(no NUMA pinning) were set up with 32, 48 and 60 vCPUs. The kernel
build times (make -j <n>, where <n> was the number of vCPUs) on
various configurations were as follows:

Westere-EX (8x4):

Kernel 32 vCPUs 48 vCPUs 60 vCPUs
------ -------- -------- --------
pvticketlock (4.1.1) 5m02.0s 13m27.6s 15m49.9s
pvqspinlock (4.2-rc1) 3m39.9s 11.17.8s 12m19.9s
patched pvqspinlock 3m38.5s 9m27.8s 9m39.4s
unfair byte lock 4m23.8s 7m14.7s 8m50.4s
unfair queued lock 3m03.4s 3m29.7s 4m15.4s

Haswell-EX (4x8):

Kernel 32 vCPUs 48 vCPUs 60 vCPUs
------ -------- -------- --------
pvticketlock (4.1.1) 1m58.9s 18m57.0s 20m46.1s
pvqspinlock (4.2-rc1) 1m59.9s 18m44.2s 18m57.0s
patched pvqspinlock 2m01.7s 8m03.7s 8m29.5s
unfair byte lock 2m04.5s 2m46.7s 3m15.6s
unfair queued lock 1m59.4s 2m04.9s 2m18.6s

It can be seen that queued unfair lock has the best performance in
almost all the cases. As can be seen in patch 4, the overhead of PV
kicking and waiting is quite high. Unfair locks avoid those overhead
and spend the time on productive work instead. On the other hand,
the pvqspinlock is fair while the byte lock is not. The queued unfair
lock is kind of in the middle between those two. It is not as fair
as the pvqspinlock, but is fairer than the byte lock.

Looking at the PV locks, the pvqspinlock patch did increase performance
in the overcommitted guests by about 20% in Westmere-EX and more than
2X in Haswell-EX. More investigation may be needed to find out why
there was slowdown in Haswell-EX compared with Westmere-EX.

In conclusion, unfair lock is actually better performance-wise when a
VM guest is over-committed. If there is no over-commitment, PV locks
work fine, too.

When the VM guest was changed to NUMA pinned (direct mapping between
physical and virtual CPUs) in the Westmere-EX system, the build
times became:

Kernel 32 vCPUs
------ --------
pvticketlock (4.1.1) 2m47.1s
pvqspinlock (4.2-rc1) 2m45.9s
patched pvqspinlock 2m45.2s
unfair byte lock 2m45.4s
unfair queued lock 2m44.9s

It can be seen that the build times are virtually the same for all
the configurations.

Waiman Long (7):
locking/pvqspinlock: Only kick CPU at unlock time
locking/pvqspinlock: Allow vCPUs kick-ahead
locking/pvqspinlock: Implement wait-early for overcommitted guest
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock: Add pending bit support
locking/qspinlock: A fairer queued unfair lock
locking/qspinlock: Collect queued unfair lock slowpath statistics

arch/x86/Kconfig | 8 +
arch/x86/include/asm/qspinlock.h | 17 +-
kernel/locking/qspinlock.c | 140 ++++++++++-
kernel/locking/qspinlock_paravirt.h | 436 ++++++++++++++++++++++++++++++++---
kernel/locking/qspinlock_unfair.h | 327 ++++++++++++++++++++++++++
5 files changed, 880 insertions(+), 48 deletions(-)
create mode 100644 kernel/locking/qspinlock_unfair.h


2015-07-11 20:37:21

by Waiman Long

[permalink] [raw]
Subject: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - one before it becomes queue head and once before
it gets the lock. All these CPU kicking and halting (VMEXIT) can be
expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 10 ++--
kernel/locking/qspinlock_paravirt.h | 83 ++++++++++++++++++++++++++---------
2 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..d2e0fc1 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock *lock)

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_scan_next(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

@@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,

#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
-#define pv_kick_node __pv_kick_node
+#define pv_scan_next __pv_scan_next
#define pv_wait_head __pv_wait_head

#ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -440,7 +440,7 @@ queue:
cpu_relax();

arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_scan_next(lock, next);

release:
/*
@@ -461,7 +461,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);

#undef pv_init_node
#undef pv_wait_node
-#undef pv_kick_node
+#undef pv_scan_next
#undef pv_wait_head

#undef queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 04ab181..d302c39 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,9 +21,14 @@

#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

+/*
+ * Queue node uses: vcpu_running & vcpu_halted.
+ * Queue head uses: vcpu_running & vcpu_hashed.
+ */
enum vcpu_state {
vcpu_running = 0,
- vcpu_halted,
+ vcpu_halted, /* Used only in pv_wait_node */
+ vcpu_hashed, /* = pv_hash'ed + vcpu_halted */
};

struct pv_node {
@@ -152,7 +157,8 @@ static void pv_init_node(struct mcs_spinlock *node)

/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
*/
static void pv_wait_node(struct mcs_spinlock *node)
{
@@ -171,9 +177,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
*
* [S] pn->state = vcpu_halted [S] next->locked = 1
* MB MB
- * [L] pn->locked [RmW] pn->state = vcpu_running
+ * [L] pn->locked [RmW] pn->state = vcpu_hashed
*
- * Matches the xchg() from pv_kick_node().
+ * Matches the cmpxchg() from pv_scan_next().
*/
smp_store_mb(pn->state, vcpu_halted);

@@ -181,9 +187,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(&pn->state, vcpu_halted);

/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * Reset the state except when vcpu_hashed is set.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);

/*
* If the locked flag is still not set after wakeup, it is a
@@ -193,6 +199,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* MCS lock will be released soon.
*/
}
+
/*
* By now our node->locked should be 1 and our caller will not actually
* spin-wait for it. We do however rely on our caller to do a
@@ -201,24 +208,32 @@ static void pv_wait_node(struct mcs_spinlock *node)
}

/*
- * Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * Called after setting next->locked = 1 & lock acquired.
+ * Check if the the vCPU has been halted. If so, set the _Q_SLOW_VAL flag
+ * and put an entry into the lock hash table to be waken up at unlock time.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct __qspinlock *l = (void *)lock;

/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
- *
- * This should be fine however, kicking people for no reason is
- * harmless.
+ * Transition vCPU state: halted => hashed
+ * Quit if the transition failed.
*
- * See the comment in pv_wait_node().
+ * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
- pv_kick(pn->cpu);
+ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ return;
+
+ /*
+ * Put the lock into the hash table & set the _Q_SLOW_VAL in the lock.
+ * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+ * the hash table later on at unlock time, no atomic instruction is
+ * needed.
+ */
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
}

/*
@@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
struct __qspinlock *l = (void *)lock;
- struct qspinlock **lp = NULL;
+ struct qspinlock **lp;
int loop;

+ /*
+ * Initialize lp to a non-NULL value if it has already been in the
+ * pv_hashed state so that pv_hash() won't be called again.
+ */
+ lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
+ : NULL;
for (;;) {
+ WRITE_ONCE(pn->state, vcpu_running);
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
cpu_relax();
}

- WRITE_ONCE(pn->state, vcpu_halted);
+ /*
+ * Recheck lock value after setting vcpu_hashed state
+ *
+ * [S] state = vcpu_hashed [S] l->locked = 0
+ * MB MB
+ * [L] l->locked [L] state == vcpu_hashed
+ *
+ * Matches smp_store_mb() in __pv_queued_spin_unlock()
+ */
+ smp_store_mb(pn->state, vcpu_hashed);
+
+ if (!READ_ONCE(l->locked)) {
+ WRITE_ONCE(pn->state, vcpu_running);
+ return;
+ }
+
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
+
/*
* lp must be set before setting _Q_SLOW_VAL
*
@@ -305,13 +343,16 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
- smp_store_release(&l->locked, 0);
+ smp_store_mb(l->locked, 0);

/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ * The other vCPU may not really be halted, but kicking an active
+ * vCPU is harmless other than the additional latency in completing
+ * the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
+ if (READ_ONCE(node->state) == vcpu_hashed)
pv_kick(node->cpu);
}
/*
--
1.7.1

2015-07-11 20:38:46

by Waiman Long

[permalink] [raw]
Subject: [PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead

Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
critical section and block forward progress. This patch implements
a kick-ahead mechanism where the unlocker will kick the queue head
vCPUs as well as up to two additional vCPUs next to the queue head if
they were halted. The kickings are done after exiting the critical
section to improve parallelism.

The amount of kick-ahead allowed depends on the number of vCPUs in
the VM guest. This change should improve overall system performance
in a busy overcommitted guest.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_paravirt.h | 71 ++++++++++++++++++++++++++++++++++-
1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index d302c39..4c1a299 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -67,6 +67,12 @@ static struct pv_hash_entry *pv_lock_hash;
static unsigned int pv_lock_hash_bits __read_mostly;

/*
+ * Allow kick-ahead of vCPUs at unlock time
+ */
+#define PV_KICK_AHEAD_MAX 4
+static int pv_kick_ahead __read_mostly;
+
+/*
* Allocate memory for the PV qspinlock hash buckets
*
* This function should be called from the paravirt spinlock initialization
@@ -74,7 +80,16 @@ static unsigned int pv_lock_hash_bits __read_mostly;
*/
void __init __pv_init_lock_hash(void)
{
- int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+ int ncpus = num_possible_cpus();
+ int pv_hash_size = ALIGN(4 * ncpus, PV_HE_PER_LINE);
+ int i;
+
+ /*
+ * The minimum number of vCPUs required in each kick-ahead level
+ */
+ static const u8 kick_ahead_threshold[PV_KICK_AHEAD_MAX] = {
+ 4, 8, 16, 32
+ };

if (pv_hash_size < PV_HE_MIN)
pv_hash_size = PV_HE_MIN;
@@ -88,6 +103,18 @@ void __init __pv_init_lock_hash(void)
pv_hash_size, 0, HASH_EARLY,
&pv_lock_hash_bits, NULL,
pv_hash_size, pv_hash_size);
+ /*
+ * Enable the unlock kick ahead mode according to the number of
+ * vCPUs available.
+ */
+ for (i = PV_KICK_AHEAD_MAX; i > 0; i--)
+ if (ncpus >= kick_ahead_threshold[i - 1]) {
+ pv_kick_ahead = i;
+ break;
+ }
+ if (pv_kick_ahead)
+ printk(KERN_INFO "PV unlock kick ahead level %d enabled\n",
+ pv_kick_ahead);
}

#define for_each_hash_entry(he, offset, hash) \
@@ -317,13 +344,33 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}

/*
+ * Helper to get the address of the next kickable node
+ * The node has to be in the halted state and is being transitioned to
+ * running state by this function. Otherwise, NULL will be returned.
+ */
+static inline struct pv_node *pv_get_kick_node(struct pv_node *node)
+{
+ struct pv_node *next = (struct pv_node *)READ_ONCE(node->mcs.next);
+
+ if (!next)
+ return NULL;
+
+ if ((READ_ONCE(next->state) != vcpu_halted) ||
+ (cmpxchg(&next->state, vcpu_halted, vcpu_running) != vcpu_halted))
+ next = NULL; /* No kicking is needed */
+
+ return next;
+}
+
+/*
* PV version of the unlock function to be used in stead of
* queued_spin_unlock().
*/
__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
{
struct __qspinlock *l = (void *)lock;
- struct pv_node *node;
+ struct pv_node *node, *nxt, *next[PV_KICK_AHEAD_MAX];
+ int i, nr_kick;

/*
* We must not unlock if SLOW, because in that case we must first
@@ -340,6 +387,20 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
node = pv_unhash(lock);

/*
+ * Implement kick-ahead mode
+ *
+ * Access the next group of nodes, if available, and prepare to kick
+ * them after releasing the lock if they are in the halted state. This
+ * should improve performance on an overcommitted system.
+ */
+ for (nr_kick = 0, nxt = node; nr_kick < pv_kick_ahead;
+ nxt = next[nr_kick], nr_kick++) {
+ next[nr_kick] = pv_get_kick_node(nxt);
+ if (!next[nr_kick])
+ break;
+ }
+
+ /*
* Now that we have a reference to the (likely) blocked pv_node,
* release the lock.
*/
@@ -354,6 +415,12 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
*/
if (READ_ONCE(node->state) == vcpu_hashed)
pv_kick(node->cpu);
+
+ /*
+ * Kick the next group of vCPUs, if available.
+ */
+ for (i = 0; i < nr_kick; i++)
+ pv_kick(next[i]->cpu);
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1

2015-07-11 20:37:23

by Waiman Long

[permalink] [raw]
Subject: [PATCH 3/7] locking/pvqspinlock: Implement wait-early for overcommitted guest

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU
later in the spinlock queue will be spinning while the ones earlier in
the queue would have been halted already. The spinning in the later
vCPUs is then just a waste of precious CPU cycles because they are
not going to get the lock soon as the earlier ones have to be woken
up and take their turn to get the lock.

This patch implements a wait-early mechanism where the vCPU will
call pv_wait() earlier if the previous vCPU is in the halted state
already. In this case, it will spin less before calling pv_wait(). On
the other hand, if the previous vCPU was running and then becomes
halted, the current vCPU will call pv_wait() immmediately in this case.

This patch also separates the spin threshold for queue head and
queue nodes. It favors the queue head by allowing it to spin longer
before calling pv_wait().

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 5 ++-
kernel/locking/qspinlock_paravirt.h | 52 +++++++++++++++++++++++++++++++++--
2 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d2e0fc1..782bc18 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -238,7 +238,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline void __pv_scan_next(struct qspinlock *lock,
struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
@@ -391,7 +392,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);

- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4c1a299..b3fe5bb 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -22,6 +22,26 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

/*
+ * Queued Spinlock Spin Thresholds
+ * -------------------------------
+ * Because of the cacheline contention effect of the ticket spinlock, the
+ * same spin threshold for queued spinlock will run a bit faster. So we set
+ * a slight larger threshold for the queue head (1.25X) while the other queue
+ * nodes will keep the same threshold.
+ *
+ * A queue node vCPU will spin less if the vCPU in the previous node is halted.
+ * The queue node vCPU will also monitor the state of the previous node
+ * periodically if it is not halted. When the previous node vCPU transitions
+ * from active to halted, the current one will go to halted state too. It is
+ * because it takes quite a lot of cycles for a vCPU to perform vmexit and
+ * vmenter. So it is better for the current vCPU to go be halted too.
+ */
+#define QHEAD_SPIN_THRESHOLD (SPIN_THRESHOLD + (SPIN_THRESHOLD/4))
+#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD
+#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 4)
+#define QNODE_SPIN_CHECK_MASK 0xff
+
+/*
* Queue node uses: vcpu_running & vcpu_halted.
* Queue head uses: vcpu_running & vcpu_hashed.
*/
@@ -187,15 +207,41 @@ static void pv_init_node(struct mcs_spinlock *node)
* pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
* behalf.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
+ bool prev_halted;
int loop;

for (;;) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ /*
+ * Spin less if the previous vCPU was in the halted state
+ */
+ prev_halted = (READ_ONCE(pp->state) != vcpu_running);
+ loop = prev_halted ? QNODE_SPIN_THRESHOLD_SHORT
+ : QNODE_SPIN_THRESHOLD;
+ while (loop--) {
if (READ_ONCE(node->locked))
return;
+ /*
+ * Look for state transition at previous node.
+ *
+ * running => halted:
+ * call pv_wait() now to halt current vCPU
+ * halted => running:
+ * reset spin threshold to QNODE_SPIN_THRESHOLD
+ */
+ if (!(loop & QNODE_SPIN_CHECK_MASK)) {
+ bool halted = (READ_ONCE(pp->state)
+ != vcpu_running);
+ if (!prev_halted && halted) {
+ break;
+ } else if (prev_halted && !halted) {
+ loop = QNODE_SPIN_THRESHOLD;
+ prev_halted = halted;
+ }
+ }
cpu_relax();
}

@@ -282,7 +328,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
: NULL;
for (;;) {
WRITE_ONCE(pn->state, vcpu_running);
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ for (loop = QHEAD_SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
cpu_relax();
--
1.7.1

2015-07-11 20:37:26

by Waiman Long

[permalink] [raw]
Subject: [PATCH 4/7] locking/pvqspinlock: Collect slowpath lock statistics

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of kicking and
wakeup latencies which have a heavy dependency on the CPUs being used.

The measured latencies for different CPUs are:

CPU Wakeup Kicking
--- ------ -------
Haswell-EX 26.4us 9.2us
Westmere-EX 99.4US 25.5us

So Haswell is much faster than Westmere.

The accumulated lock statistics will be reported in debugfs under the
pv-qspinlock directory.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++
kernel/locking/qspinlock_paravirt.h | 173 ++++++++++++++++++++++++++++++++++-
2 files changed, 177 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..299a1c4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -679,6 +679,13 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer Y.

+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+
source "arch/x86/xen/Kconfig"

config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index b3fe5bb..efc9a72 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -60,6 +60,155 @@ struct pv_node {
};

/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_wait_early,
+ pvstat_kick_wake,
+ pvstat_kick_cpu,
+ pvstat_kick_ahead,
+ pvstat_no_kick,
+ pvstat_spurious,
+ pvstat_hash,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_early] = "wait_early_count",
+ [pvstat_kick_wake] = "kick_wake_count",
+ [pvstat_kick_cpu] = "kick_cpu_count",
+ [pvstat_kick_ahead] = "kick_ahead_count",
+ [pvstat_no_kick] = "no_kick_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hash] = "hash_count",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/(kick_cpu_count + kick_ahead_count)
+ * Avg wake latency = pv_wake_latencies/kick_wake_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ printk(KERN_WARNING
+ "Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_inc(&pvstats[pvstat_hash]);
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 kick_time;
+
+ pv_wait(ptr, val);
+ kick_time = *this_cpu_ptr(&pv_kick_time);
+ if (kick_time) {
+ atomic64_add(sched_clock() - kick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_wake);
+ *this_cpu_ptr(&pv_kick_time) = 0;
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -146,10 +295,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;

for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -221,6 +373,8 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
prev_halted = (READ_ONCE(pp->state) != vcpu_running);
loop = prev_halted ? QNODE_SPIN_THRESHOLD_SHORT
: QNODE_SPIN_THRESHOLD;
+ if (prev_halted)
+ pvstat_inc(pvstat_wait_early);
while (loop--) {
if (READ_ONCE(node->locked))
return;
@@ -236,6 +390,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
bool halted = (READ_ONCE(pp->state)
!= vcpu_running);
if (!prev_halted && halted) {
+ pvstat_inc(pvstat_wait_early);
break;
} else if (prev_halted && !halted) {
loop = QNODE_SPIN_THRESHOLD;
@@ -256,14 +411,18 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
*/
smp_store_mb(pn->state, vcpu_halted);

- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node);
pv_wait(&pn->state, vcpu_halted);
+ }

/*
* Reset the state except when vcpu_hashed is set.
*/
cmpxchg(&pn->state, vcpu_halted, vcpu_running);

+ if (READ_ONCE(node->locked))
+ break;
/*
* If the locked flag is still not set after wakeup, it is a
* spurious wakeup and the vCPU should wait again. However,
@@ -271,6 +430,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious);
}

/*
@@ -372,6 +532,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head);
pv_wait(&l->locked, _Q_SLOW_VAL);

/*
@@ -459,14 +620,20 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* vCPU is harmless other than the additional latency in completing
* the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_hashed)
+ if (READ_ONCE(node->state) == vcpu_hashed) {
+ pvstat_inc(pvstat_kick_cpu);
pv_kick(node->cpu);
+ } else {
+ pvstat_inc(pvstat_no_kick);
+ }

/*
* Kick the next group of vCPUs, if available.
*/
- for (i = 0; i < nr_kick; i++)
+ for (i = 0; i < nr_kick; i++) {
+ pvstat_inc(pvstat_kick_ahead);
pv_kick(next[i]->cpu);
+ }
}
/*
* Include the architecture specific callee-save thunk of the
--
1.7.1

2015-07-11 20:37:28

by Waiman Long

[permalink] [raw]
Subject: [PATCH 5/7] locking/pvqspinlock: Add pending bit support

Like the native qspinlock, using the pending bit when it is lightly
loaded to acquire the lock is faster than going through the PV queuing
process which is even slower than the native queuing process. It also
avoids loading two additional cachelines (the MCS and PV nodes).

This patch adds the pending bit support for PV qspinlock. The
pending bit code has a smaller spin threshold. It will default back
to the queuing method if it cannot acquired the lock within a certain
time limit.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 27 +++++++++++++++-
kernel/locking/qspinlock_paravirt.h | 61 +++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 1 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 782bc18..5a25e89 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
}

+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ WRITE_ONCE(l->pending, 0);
+}
+
/*
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
@@ -193,6 +204,15 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
}

/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL, &lock->val);
+}
+
+/**
* xchg_tail - Put in the new queue tail code word & retrieve previous one
* @lock : Pointer to queued spinlock structure
* @tail : The new queue tail code word
@@ -246,6 +266,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

#define pv_enabled() false
+#define pv_pending_lock(l, v) false

#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
@@ -287,8 +308,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

- if (pv_enabled())
+ if (pv_enabled()) {
+ if (pv_pending_lock(lock, val))
+ return; /* Got the lock via pending bit */
goto queue;
+ }

if (virt_queued_spin_lock(lock))
return;
@@ -464,6 +488,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_wait_node
#undef pv_scan_next
#undef pv_wait_head
+#undef pv_pending_lock

#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index efc9a72..d770694 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -40,6 +40,7 @@
#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD
#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 4)
#define QNODE_SPIN_CHECK_MASK 0xff
+#define PENDING_SPIN_THRESHOLD QNODE_SPIN_THRESHOLD_SHORT

/*
* Queue node uses: vcpu_running & vcpu_halted.
@@ -70,6 +71,8 @@ enum pv_qlock_stat {
pvstat_kick_cpu,
pvstat_kick_ahead,
pvstat_no_kick,
+ pvstat_pend_lock,
+ pvstat_pend_fail,
pvstat_spurious,
pvstat_hash,
pvstat_hops,
@@ -91,6 +94,8 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_kick_cpu] = "kick_cpu_count",
[pvstat_kick_ahead] = "kick_ahead_count",
[pvstat_no_kick] = "no_kick_count",
+ [pvstat_pend_lock] = "pending_lock_count",
+ [pvstat_pend_fail] = "pending_fail_count",
[pvstat_spurious] = "spurious_wakeup",
[pvstat_hash] = "hash_count",
[pvstat_hops] = "hash_hops_count",
@@ -355,6 +360,62 @@ static void pv_init_node(struct mcs_spinlock *node)
}

/*
+ * Try to acquire the lock and wait using the pending bit
+ */
+static int pv_pending_lock(struct qspinlock *lock, u32 val)
+{
+ int loop = PENDING_SPIN_THRESHOLD;
+ u32 new, old;
+
+ /*
+ * wait for in-progress pending->locked hand-overs
+ */
+ if (val == _Q_PENDING_VAL) {
+ while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
+ loop--)
+ cpu_relax();
+ }
+
+ /*
+ * trylock || pending
+ */
+ for (;;) {
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+ new = _Q_LOCKED_VAL;
+ if (val == new)
+ new |= _Q_PENDING_VAL;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+ if (loop-- <= 0)
+ goto queue;
+ }
+
+ if (new == _Q_LOCKED_VAL)
+ goto gotlock;
+ /*
+ * We are pending, wait for the owner to go away.
+ */
+ while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
+ && (loop-- > 0))
+ cpu_relax();
+ if (!(val & _Q_LOCKED_MASK)) {
+ clear_pending_set_locked(lock);
+ goto gotlock;
+ }
+ clear_pending(lock); /* Clear the pending bit only */
+ pvstat_inc(pvstat_pend_fail);
+
+queue:
+ return 0;
+
+gotlock:
+ pvstat_inc(pvstat_pend_lock);
+ return 1;
+}
+
+/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
* pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
* behalf.
--
1.7.1

2015-07-11 20:38:04

by Waiman Long

[permalink] [raw]
Subject: [PATCH 6/7] locking/qspinlock: A fairer queued unfair lock

For a virtual guest with the qspinlock patch, a simple unfair byte lock
will be used if PV spinlock is not configured in or the hypervisor
isn't either KVM or Xen. The byte lock works fine with small guest
of just a few vCPUs. On a much larger guest, however, byte lock can
have the following problems:

1) Lock starvation is a real possibility especially if the number
of vCPUs is large.
2) The constant reading and occasionally writing to the lock word can
put a lot of cacheline contention traffic on the affected
cacheline.

This patch introduces a queue-based unfair lock where all the vCPUs on
the queue can opportunistically steal the lock, but the frequency of
doing so decreases the further it is away from the queue head. It can
encourage a more FIFO like order of getting the lock and hence greatly
reduce the chance of lock starvation. It can also reduce cacheline
contention problem and so improve the performance of the system.

This patch has no impact on native qspinlock performance at all. The
unfair lock code will only be compiled in if CONFIG_HYPERVISOR_GUEST
is defined.

A microbenchmark of running 1 million lock-unlock operation for various
number of threads running on a KVM guest with 32 pinned vCPUs and 4
vCPUs per node (8 nodes). This microbenchmark is intended to measure
the variability of the execution times.

Kernel Threads Min/Avg/Max(ms) SD(ms)
------ ------- --------------- ------
Unfair byte lock 4 133.1/386.0/509.0 153.48
8 720.5/939.5/1,068.0 117.08
16 2,237.8/6,045.8/7,550.3 1747.37
32 5,880.2/37,028.2/44,668.7 10136.30
Unfair qspinlock 4 326.1/453.7/523.0 80.44
8 681.6/1,126.4/1,486.5 304.85
16 1,543.0/3,633.4/4,568.1 1000.47
32 2,356.8/7,103.3/7,894.9 1231.11

With small number of contending threads, both the performance and
variability of both types of unfair lock are similar. However, when
the number of contending threads increases, the byte lock has a much
higher variability than the unfair qspinlock.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qspinlock.h | 17 ++--
kernel/locking/qspinlock.c | 98 ++++++++++++++-
kernel/locking/qspinlock_unfair.h | 238 +++++++++++++++++++++++++++++++++++++
3 files changed, 340 insertions(+), 13 deletions(-)
create mode 100644 kernel/locking/qspinlock_unfair.h

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 9d51fae..bc82ace 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -39,18 +39,19 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
}
#endif

-#define virt_queued_spin_lock virt_queued_spin_lock
+#ifdef CONFIG_HYPERVISOR_GUEST
+#ifndef static_cpu_has_hypervisor
+#define static_cpu_has_hypervisor static_cpu_has(X86_FEATURE_HYPERVISOR)
+#endif

-static inline bool virt_queued_spin_lock(struct qspinlock *lock)
+#define queued_spin_trylock_unfair queued_spin_trylock_unfair
+static inline bool queued_spin_trylock_unfair(struct qspinlock *lock)
{
- if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
- return false;
-
- while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
- cpu_relax();
+ u8 *l = (u8 *)lock;

- return true;
+ return !READ_ONCE(*l) && (xchg(l, _Q_LOCKED_VAL) == 0);
}
+#endif

#include <asm-generic/qspinlock.h>

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5a25e89..65dead9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -19,7 +19,11 @@
* Peter Zijlstra <[email protected]>
*/

-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if defined(_GEN_PV_LOCK_SLOWPATH) || defined(_GEN_UNFAIR_LOCK_SLOWPATH)
+#define _GEN_LOCK_SLOWPATH
+#endif
+
+#ifndef _GEN_LOCK_SLOWPATH

#include <linux/smp.h>
#include <linux/bug.h>
@@ -68,7 +72,7 @@

#include "mcs_spinlock.h"

-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#ifdef CONFIG_HYPERVISOR_GUEST
#define MAX_NODES 8
#else
#define MAX_NODES 4
@@ -81,6 +85,7 @@
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
*
* PV doubles the storage and uses the second cacheline for PV state.
+ * Unfair lock (mutually exclusive to PV) also uses the second cacheline.
*/
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);

@@ -277,7 +282,18 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif

-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#ifdef CONFIG_HYPERVISOR_GUEST
+static void unfair_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+#else
+static __always_inline void
+unfair_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) { }
+#endif
+
+#ifndef static_cpu_has_hypervisor
+#define static_cpu_has_hypervisor false
+#endif
+
+#endif /* !_GEN_LOCK_SLOWPATH */

/**
* queued_spin_lock_slowpath - acquire the queued spinlock
@@ -314,8 +330,13 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
goto queue;
}

- if (virt_queued_spin_lock(lock))
+ /*
+ * Use unfair lock for VM if PV spinlock is not enabled
+ */
+ if (static_cpu_has_hypervisor) {
+ unfair_queued_spin_lock_slowpath(lock, val);
return;
+ }

/*
* wait for in-progress pending->locked hand-overs
@@ -478,7 +499,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
/*
* Generate the paravirt code for queued_spin_unlock_slowpath().
*/
-#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#if !defined(_GEN_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
#define _GEN_PV_LOCK_SLOWPATH

#undef pv_enabled
@@ -496,4 +517,71 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#include "qspinlock_paravirt.h"
#include "qspinlock.c"

+#undef _GEN_PV_LOCK_SLOWPATH
+
+#endif
+
+/*
+ * Generate the unfair lock code for queued_spin_unlock_slowpath().
+ */
+#if defined(CONFIG_HYPERVISOR_GUEST) && !defined(_GEN_LOCK_SLOWPATH)
+#define _GEN_UNFAIR_LOCK_SLOWPATH
+
+#undef pv_enabled
+#define pv_enabled() true
+
+#ifdef queued_spin_lock_slowpath
+#undef queued_spin_lock_slowpath
+#endif
+#define queued_spin_lock_slowpath unfair_queued_spin_lock_slowpath
+
+#ifdef queued_spin_trylock
+#undef queued_spin_trylock
+#endif
+#define queued_spin_trylock queued_spin_trylock_unfair
+
+/*
+ * The unfair lock code is used internally and so don't need to be exported
+ */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(x)
+
+#ifdef pv_init_node
+#undef pv_init_node
+#endif
+#ifdef pv_wait_node
+#undef pv_wait_node
+#endif
+#ifdef pv_wait_head
+#undef pv_wait_head
+#endif
+
+#ifndef pv_scan_next
+#define pv_scan_next __pv_scan_next
+#endif
+#ifndef pv_pending_lock
+#define pv_pending_lock(l, v) false
+#endif
+
+#define pv_init_node unfair_init_node
+#define pv_wait_node(node, prev) \
+ do { \
+ if (unfair_wait_node(lock, node, prev, tail, old)) \
+ goto release; \
+ } while (0)
+#define pv_wait_head(lock, node) \
+ do { \
+ if (unfair_wait_head(lock, node, tail)) \
+ goto release; \
+ } while (0)
+
+#include "qspinlock_unfair.h"
+#include "qspinlock.c"
+
+#undef _GEN_UNFAIR_LOCK_SLOWPATH
+
+#endif
+
+#ifdef _GEN_LOCK_SLOWPATH
+#undef _GEN_LOCK_SLOWPATH
#endif
diff --git a/kernel/locking/qspinlock_unfair.h b/kernel/locking/qspinlock_unfair.h
new file mode 100644
index 0000000..0e8a40f
--- /dev/null
+++ b/kernel/locking/qspinlock_unfair.h
@@ -0,0 +1,238 @@
+#ifndef _GEN_UNFAIR_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+/*
+ * Unfair lock support in a virtualized guest
+ *
+ * An unfair lock can be implemented using a simple test-and-set lock like
+ * what is being done in a read-write lock. This simple scheme has 2 major
+ * problems:
+ * 1) It needs constant reading and occasionally writing to the lock word
+ * thus putting a lot of cacheline contention traffic on the affected
+ * cacheline.
+ * 2) Lock starvation is a real possibility especially if the number of
+ * virtual CPUs is large.
+ *
+ * To reduce the undesirable side effects of an unfair lock, the queue
+ * unfair spinlock implements a more elaborate scheme.
+ *
+ * Even in the wait queue, the task can still attempt to steal the lock
+ * periodically at a frequency that decreases in relation to its distance
+ * from the queue head. In other word, the closer it is to the queue head,
+ * the higher the chance it has of stealing the lock. This scheme reduces
+ * the load on the lock cacheline while trying to maintain a somewhat FIFO
+ * order of getting the lock so as to reduce the chance of lock starvation.
+ *
+ * At less than the threshold distance from the queue head, the lock stealing
+ * period increase is exponential (*2). At the threshold and beyond, the
+ * increase is linear (+LPERIOD_INC).
+ */
+#define LPERIOD_MIN_SHIFT 2
+#define LPERIOD_THRESHOLD_SHIFT 8
+#define LPERIOD_MIN (1 << LPERIOD_MIN_SHIFT)
+#define LPERIOD_THRESHOLD (1 << LPERIOD_THRESHOLD_SHIFT)
+#define LPERIOD_QHEAD (LPERIOD_MIN >> 1)
+#define LPERIOD_INC 32 /* Linear increment beyond threshold */
+
+struct uf_node {
+ struct mcs_spinlock mcs;
+ struct mcs_spinlock __res[3];
+
+ struct uf_node *prev; /* Previous node address */
+ int lsteal_period; /* Lock stealing period */
+ u32 prev_tail; /* Previous node tail code */
+};
+
+/**
+ * cmpxchg_tail - Put in the new tail code if it matches the old one
+ * @lock : Pointer to queue spinlock structure
+ * @old : The old tail code value
+ * @new : The new tail code value
+ * Return: true if operation succeeds, false otherwise
+ *
+ * The caller must have grabbed the lock before calling it and pending bit
+ * isn't used.
+ */
+static __always_inline int
+cmpxchg_tail(struct qspinlock *lock, u32 old, u32 new)
+{
+ old |= _Q_LOCKED_VAL;
+ new |= _Q_LOCKED_VAL;
+ return atomic_cmpxchg(&lock->val, old, new) == old;
+}
+
+/**
+ * Initialize the unfair part of the mcs_spinlock node.
+ * @node: Pointer to mcs_spinlock structure
+ */
+static inline void unfair_init_node(struct mcs_spinlock *node)
+{
+ struct uf_node *pn = (struct uf_node *)node;
+
+ BUILD_BUG_ON(sizeof(struct uf_node) > 5*sizeof(struct mcs_spinlock));
+
+ pn->prev = NULL;
+ pn->prev_tail = 0;
+ pn->lsteal_period = LPERIOD_QHEAD;
+}
+
+/**
+ * Wait for node->locked to become true or the lock was stolen.
+ * @lock : Pointer to queue spinlock structure
+ * @node : Pointer to mcs_spinlock structure of current node
+ * @prev : Pointer to mcs_spinlock structure of previous node
+ * @my_tail : Tail code of current node
+ * @prev_tail: Tail code of previous node
+ *
+ * Return: true if lock stolen, false if becoming queue head
+ */
+static inline bool unfair_wait_node(struct qspinlock *lock,
+ struct mcs_spinlock *node,
+ struct mcs_spinlock *prev,
+ u32 my_tail, u32 prev_tail)
+{
+ struct uf_node *next, *pn = (struct uf_node *)node;
+ int loop;
+ bool isqhead;
+
+ pn->prev = (struct uf_node *)prev;
+ pn->prev_tail = prev_tail;
+
+ for (;;) {
+ /*
+ * The spinning period of this node will be double that of the
+ * previous node if it is at less than the threshold distance
+ * from the queue head. Otherwise, it will spin a fixed
+ * additional amount thatn the previous one.
+ */
+ u32 period = READ_ONCE(pn->prev->lsteal_period);
+
+ pn->lsteal_period = (period < LPERIOD_THRESHOLD)
+ ? period * 2 : period + LPERIOD_INC;
+ for (loop = pn->lsteal_period; loop; loop--) {
+ /*
+ * The caller will do a load-acquire for us.
+ */
+ if (READ_ONCE(node->locked))
+ return false;
+ cpu_relax();
+ }
+
+ /*
+ * Attempt to steal the lock
+ */
+ if (queued_spin_trylock_unfair(lock))
+ break; /* Got the lock */
+ }
+
+ /*
+ * Have stolen the lock, need to remove itself from the wait queue.
+ * There are 3 steps that need to be done:
+ * 1) If it is at the end of the queue, change the tail code in the
+ * lock to the one before it. If the current node has become the
+ * queue head, the previous tail code is 0.
+ * 2) Change the next pointer in the previous queue node to point
+ * to the next one in queue or NULL if it is at the end of queue.
+ * 3) If a next node is present, copy the prev_tail and prev values
+ * to the next node.
+ *
+ * Note that no forward progress is possible in other nodes as the
+ * lock has just been acquired.
+ */
+ isqhead = READ_ONCE(pn->mcs.locked);
+ prev_tail = isqhead ? 0 : pn->prev_tail;
+
+ if (isqhead)
+ pn->lsteal_period = LPERIOD_QHEAD;
+
+ /* Step 1 */
+ if ((atomic_read(&lock->val) == (my_tail | _Q_LOCKED_VAL)) &&
+ cmpxchg_tail(lock, my_tail, prev_tail)) {
+ /*
+ * Successfully change the tail code back to the
+ * previous one. Now need to clear the next pointer
+ * in the previous node only if it contains my queue
+ * node address and is not the queue head.
+ *
+ * The cmpxchg() call below may fail if a new task
+ * comes along and put its node address into the next
+ * pointer. Whether the operation succeeds or fails
+ * doesn't really matter.
+ */
+ /* Step 2 */
+ if (!isqhead)
+ (void)cmpxchg(&pn->prev->mcs.next, &pn->mcs, NULL);
+ pn->mcs.next = NULL;
+ return true;
+ }
+
+ /*
+ * Step 3
+ * A next node has to be present if the lock has a different
+ * tail code. So wait until the next pointer is set.
+ */
+ while (!(next = (struct uf_node *)READ_ONCE(node->next)))
+ cpu_relax();
+ if (isqhead) {
+ struct mcs_spinlock *nxt = (struct mcs_spinlock *)next;
+
+ arch_mcs_spin_unlock_contended(&nxt->locked);
+ return true; /* Done for queue head */
+ }
+
+ WRITE_ONCE(pn->prev->mcs.next, (struct mcs_spinlock *)next);
+
+ /*
+ * Need to make sure that both prev and prev_tail of the next node
+ * are set up before modifying them.
+ */
+ while (!(READ_ONCE(next->prev) && READ_ONCE(next->prev_tail)))
+ cpu_relax();
+ WRITE_ONCE(next->prev, pn->prev);
+ WRITE_ONCE(next->prev_tail, pn->prev_tail);
+
+ /*
+ * A smp_wmb() is inserted here to make sure all the new unfair node
+ * information in next will be visible before unlock which is not
+ * a full memory barrier.
+ *
+ * It matches the atomic instruction in queued_spin_lock_unfair().
+ */
+ smp_wmb();
+ return true;
+}
+
+/**
+ * As queue head, attempt to acquire the lock in a normal spin loop.
+ * @lock: Pointer to queue spinlock structure
+ * @node: Pointer to mcs_spinlock structure of current node
+ * @tail: Tail code of current node
+ *
+ * Return: true is always returned
+ */
+static inline bool
+unfair_wait_head(struct qspinlock *lock, struct mcs_spinlock *node, u32 tail)
+{
+ struct uf_node *pn = (struct uf_node *)node;
+ struct mcs_spinlock *next;
+
+ pn->lsteal_period = LPERIOD_QHEAD;
+ while (!queued_spin_trylock_unfair(lock))
+ cpu_relax();
+
+ /*
+ * Remove tail code in the lock if it is the only one in the queue
+ */
+ if ((atomic_read(&lock->val) == (tail | _Q_LOCKED_VAL)) &&
+ cmpxchg_tail(lock, tail, 0))
+ goto release;
+
+ while (!(next = READ_ONCE(node->next)))
+ cpu_relax();
+
+ arch_mcs_spin_unlock_contended(&next->locked);
+
+release:
+ return true;
+}
--
1.7.1

2015-07-11 20:37:36

by Waiman Long

[permalink] [raw]
Subject: [PATCH 7/7] locking/qspinlock: Collect queued unfair lock slowpath statistics

This patch enables the accumulation of unfair qspinlock statistics
when the CONFIG_QUEUED_LOCK_STAT configuration parameter is set.

The accumulated lock statistics will be reported in debugfs under
the unfair-qspinlock directory.

On a KVM guest with 32 vCPUs, the statistics counts after bootup were:

lsteal_cnts = 172219 2377 425 118 33 8 5 12 14 0 0 0
trylock_cnt = 1495372

So most of the lock stealing happened in the initial trylock before
entering the queue. Once a vCPU is in the queue, the chance of getting
the lock drop off significantly the further it is away from queue head.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 7 ++-
kernel/locking/qspinlock.c | 2 +-
kernel/locking/qspinlock_unfair.h | 89 +++++++++++++++++++++++++++++++++++++
3 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 299a1c4..aee6236 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -680,11 +680,12 @@ config PARAVIRT_SPINLOCKS
If you are unsure how to answer this question, answer Y.

config QUEUED_LOCK_STAT
- bool "Paravirt queued lock statistics"
- depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ bool "Paravirt/Unfair queued lock statistics"
+ depends on DEBUG_FS && QUEUED_SPINLOCKS
---help---
Enable the collection of statistical data on the behavior of
- paravirtualized queued spinlocks and report them on debugfs.
+ paravirtualized and unfair queued spinlocks and report them
+ on debugfs.

source "arch/x86/xen/Kconfig"

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 65dead9..12e2e89 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -538,7 +538,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#ifdef queued_spin_trylock
#undef queued_spin_trylock
#endif
-#define queued_spin_trylock queued_spin_trylock_unfair
+#define queued_spin_trylock __queued_spin_trylock_unfair

/*
* The unfair lock code is used internally and so don't need to be exported
diff --git a/kernel/locking/qspinlock_unfair.h b/kernel/locking/qspinlock_unfair.h
index 0e8a40f..fc94578 100644
--- a/kernel/locking/qspinlock_unfair.h
+++ b/kernel/locking/qspinlock_unfair.h
@@ -44,6 +44,93 @@ struct uf_node {
u32 prev_tail; /* Previous node tail code */
};

+#ifdef CONFIG_QUEUED_LOCK_STAT
+
+#include <linux/debugfs.h>
+
+/*
+ * Unfair qspinlock statistics
+ *
+ * All spinning CPUs are grouped into buckets depending on the most
+ * significant bit in their lock stealing period. The first entry in
+ * the array is for the queue head.
+ */
+#define NR_LPERIOD_CNTS (LPERIOD_THRESHOLD_SHIFT - LPERIOD_MIN_SHIFT + 6)
+static atomic_t lsteal_cnts[NR_LPERIOD_CNTS];
+
+/*
+ * # of successful trylocks at beginning of slowpath
+ */
+static atomic_t trylock_cnt;
+
+/*
+ * Counts reset flag
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the unfair qspinlock statistics
+ */
+static int __init unfair_qspinlock_debugfs(void)
+{
+ struct dentry *d_ufqlock = debugfs_create_dir("unfair-qspinlock", NULL);
+
+ if (!d_ufqlock)
+ printk(KERN_WARNING
+ "Could not create 'unfair-qspinlock' debugfs directory\n");
+
+ debugfs_create_u32_array("lsteal_cnts", 0444, d_ufqlock,
+ (u32 *)lsteal_cnts, NR_LPERIOD_CNTS);
+ debugfs_create_u32("trylock_cnt", 0444, d_ufqlock, (u32 *)&trylock_cnt);
+ debugfs_create_bool("reset_cnts", 0644, d_ufqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(unfair_qspinlock_debugfs);
+
+/*
+ * Reset all the statistics counts
+ */
+static noinline void reset_counts(void)
+{
+ int idx;
+
+ reset_cnts = false;
+ atomic_set(&trylock_cnt, 0);
+ for (idx = 0 ; idx < NR_LPERIOD_CNTS; idx++)
+ atomic_set(&lsteal_cnts[idx], 0);
+}
+
+/*
+ * Increment the unfair qspinlock statistic count
+ */
+static inline void ustat_inc(struct uf_node *pn)
+{
+ /*
+ * fls() returns the most significant 1 bit position + 1
+ */
+ int idx = fls(pn->lsteal_period) - LPERIOD_MIN_SHIFT;
+
+ if (idx >= NR_LPERIOD_CNTS)
+ idx = NR_LPERIOD_CNTS - 1;
+ atomic_inc(&lsteal_cnts[idx]);
+ if (unlikely(reset_cnts))
+ reset_counts();
+}
+
+static inline bool __queued_spin_trylock_unfair(struct qspinlock *lock)
+{
+ bool ret = queued_spin_trylock_unfair(lock);
+
+ if (ret)
+ atomic_inc(&trylock_cnt);
+ return ret;
+}
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+static inline void ustat_inc(struct uf_node *pn) { }
+#define __queued_spin_trylock_unfair queued_spin_trylock_unfair
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
/**
* cmpxchg_tail - Put in the new tail code if it matches the old one
* @lock : Pointer to queue spinlock structure
@@ -125,6 +212,7 @@ static inline bool unfair_wait_node(struct qspinlock *lock,
if (queued_spin_trylock_unfair(lock))
break; /* Got the lock */
}
+ ustat_inc(pn);

/*
* Have stolen the lock, need to remove itself from the wait queue.
@@ -220,6 +308,7 @@ unfair_wait_head(struct qspinlock *lock, struct mcs_spinlock *node, u32 tail)
pn->lsteal_period = LPERIOD_QHEAD;
while (!queued_spin_trylock_unfair(lock))
cpu_relax();
+ ustat_inc(pn);

/*
* Remove tail code in the lock if it is the only one in the queue
--
1.7.1

2015-07-12 08:22:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] locking/qspinlock: A fairer queued unfair lock

On Sat, Jul 11, 2015 at 04:36:57PM -0400, Waiman Long wrote:
> For a virtual guest with the qspinlock patch, a simple unfair byte lock
> will be used if PV spinlock is not configured in or the hypervisor
> isn't either KVM or Xen.

Why do we care about this case enough to add over 300 lines of code?

2015-07-12 08:22:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/7] locking/pvqspinlock: Add pending bit support

On Sat, Jul 11, 2015 at 04:36:56PM -0400, Waiman Long wrote:
> Like the native qspinlock, using the pending bit when it is lightly
> loaded to acquire the lock is faster than going through the PV queuing
> process which is even slower than the native queuing process. It also
> avoids loading two additional cachelines (the MCS and PV nodes).
>
> This patch adds the pending bit support for PV qspinlock. The
> pending bit code has a smaller spin threshold. It will default back
> to the queuing method if it cannot acquired the lock within a certain
> time limit.

-ENONUMBERS

2015-07-12 08:22:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] locking/pvqspinlock: Collect slowpath lock statistics

On Sat, Jul 11, 2015 at 04:36:55PM -0400, Waiman Long wrote:
> This patch enables the accumulation of kicking and waiting related
> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
> option is selected. It also enables the collection of kicking and
> wakeup latencies which have a heavy dependency on the CPUs being used.
>
> The measured latencies for different CPUs are:
>
> CPU Wakeup Kicking
> --- ------ -------
> Haswell-EX 26.4us 9.2us
> Westmere-EX 99.4US 25.5us
>
> So Haswell is much faster than Westmere.
>
> The accumulated lock statistics will be reported in debugfs under the
> pv-qspinlock directory.

You've again stuck this behind non-trivial patches :-(

2015-07-12 08:23:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] locking/pvqspinlock: Implement wait-early for overcommitted guest

On Sat, Jul 11, 2015 at 04:36:54PM -0400, Waiman Long wrote:
> In an overcommitted guest where some vCPUs have to be halted to make
> forward progress in other areas, it is highly likely that a vCPU
> later in the spinlock queue will be spinning while the ones earlier in
> the queue would have been halted already. The spinning in the later
> vCPUs is then just a waste of precious CPU cycles because they are
> not going to get the lock soon as the earlier ones have to be woken
> up and take their turn to get the lock.
>
> This patch implements a wait-early mechanism where the vCPU will
> call pv_wait() earlier if the previous vCPU is in the halted state
> already. In this case, it will spin less before calling pv_wait(). On
> the other hand, if the previous vCPU was running and then becomes
> halted, the current vCPU will call pv_wait() immmediately in this case.
>
> This patch also separates the spin threshold for queue head and
> queue nodes. It favors the queue head by allowing it to spin longer
> before calling pv_wait().

-ENONUMBERS

2015-07-13 12:02:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
> @@ -181,9 +187,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
> pv_wait(&pn->state, vcpu_halted);
>
> /*
> - * Reset the vCPU state to avoid unncessary CPU kicking
> + * Reset the state except when vcpu_hashed is set.
> */
> - WRITE_ONCE(pn->state, vcpu_running);
> + cmpxchg(&pn->state, vcpu_halted, vcpu_running);

Why? Suppose we did get advanced into the hashed state, and then get a
(spurious) wakeup, this means we'll observe our ->locked == 1 condition
and fall out of pv_wait_node().

We'll then enter pv_wait_head(), which with your modification:

> @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> {
> struct pv_node *pn = (struct pv_node *)node;
> struct __qspinlock *l = (void *)lock;
> - struct qspinlock **lp = NULL;
> + struct qspinlock **lp;
> int loop;
>
> + /*
> + * Initialize lp to a non-NULL value if it has already been in the
> + * pv_hashed state so that pv_hash() won't be called again.
> + */
> + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
> + : NULL;
> for (;;) {
> + WRITE_ONCE(pn->state, vcpu_running);

Will instantly and unconditionally write vcpu_running.

2015-07-13 12:31:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On Mon, Jul 13, 2015 at 02:02:02PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
> > @@ -181,9 +187,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
> > pv_wait(&pn->state, vcpu_halted);
> >
> > /*
> > - * Reset the vCPU state to avoid unncessary CPU kicking
> > + * Reset the state except when vcpu_hashed is set.
> > */
> > - WRITE_ONCE(pn->state, vcpu_running);
> > + cmpxchg(&pn->state, vcpu_halted, vcpu_running);
>
> Why? Suppose we did get advanced into the hashed state, and then get a
> (spurious) wakeup, this means we'll observe our ->locked == 1 condition
> and fall out of pv_wait_node().
>
> We'll then enter pv_wait_head(), which with your modification:
>
> > @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> > {
> > struct pv_node *pn = (struct pv_node *)node;
> > struct __qspinlock *l = (void *)lock;
> > - struct qspinlock **lp = NULL;
> > + struct qspinlock **lp;
> > int loop;
> >
> > + /*
> > + * Initialize lp to a non-NULL value if it has already been in the
> > + * pv_hashed state so that pv_hash() won't be called again.
> > + */
> > + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
> > + : NULL;

Because that ^

> > for (;;) {
> > + WRITE_ONCE(pn->state, vcpu_running);
>
> Will instantly and unconditionally write vcpu_running.
>
>

2015-07-13 13:48:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
> @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> {
> struct pv_node *pn = (struct pv_node *)node;
> struct __qspinlock *l = (void *)lock;
> - struct qspinlock **lp = NULL;
> + struct qspinlock **lp;
> int loop;
>
> + /*
> + * Initialize lp to a non-NULL value if it has already been in the
> + * pv_hashed state so that pv_hash() won't be called again.
> + */
> + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
> + : NULL;
> for (;;) {
> + WRITE_ONCE(pn->state, vcpu_running);
> for (loop = SPIN_THRESHOLD; loop; loop--) {
> if (!READ_ONCE(l->locked))
> return;
> cpu_relax();
> }
>
> - WRITE_ONCE(pn->state, vcpu_halted);
> + /*
> + * Recheck lock value after setting vcpu_hashed state
> + *
> + * [S] state = vcpu_hashed [S] l->locked = 0
> + * MB MB
> + * [L] l->locked [L] state == vcpu_hashed
> + *
> + * Matches smp_store_mb() in __pv_queued_spin_unlock()
> + */
> + smp_store_mb(pn->state, vcpu_hashed);
> +
> + if (!READ_ONCE(l->locked)) {
> + WRITE_ONCE(pn->state, vcpu_running);
> + return;
> + }
> +
> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> +
> /*
> * lp must be set before setting _Q_SLOW_VAL
> *
> @@ -305,13 +343,16 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
> * Now that we have a reference to the (likely) blocked pv_node,
> * release the lock.
> */
> - smp_store_release(&l->locked, 0);
> + smp_store_mb(l->locked, 0);
>
> /*
> * At this point the memory pointed at by lock can be freed/reused,
> * however we can still use the pv_node to kick the CPU.
> + * The other vCPU may not really be halted, but kicking an active
> + * vCPU is harmless other than the additional latency in completing
> + * the unlock.
> */
> - if (READ_ONCE(node->state) == vcpu_halted)
> + if (READ_ONCE(node->state) == vcpu_hashed)
> pv_kick(node->cpu);
> }

I think most of that is not actually required; if we let pv_kick_node()
set vcpu_hashed and avoid writing another value in pv_wait_head(), then
__pv_queued_spin_unlock() has two cases:

- pv_kick_node() set _SLOW_VAL, which is the same 'thread' and things
observe program order and we're trivially guaranteed to see
node->state and the hash state.

- pv_wait_head() set it, and because it does the @lp branch the
existing barrier from our cmpxchg() there setting _SLOW_VAL matches
the cmpxchg() in __pv_queued_spin_unlock() observing _SLOW_VAL
and we still observe the right node and hash values.

Giving something like the patch below.

On the s/pv_scan_node/pv_kick_node/ rename; conceptually we still make
the waiter go, the fact that we don't wake it is an implementation
detail.

---
Subject: locking/pvqspinlock: Only kick CPU at unlock time
From: Waiman Long <[email protected]>
Date: Sat, 11 Jul 2015 16:36:52 -0400

For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Cc: Scott J Norton <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
[peterz: s/pv_scan_node/pv_kick_node/, simplification of pv_wait_head()]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/locking/qspinlock.c | 6 +--
kernel/locking/qspinlock_paravirt.h | 72 +++++++++++++++++++++++++++---------
2 files changed, 57 insertions(+), 21 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(s

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

@@ -440,7 +440,7 @@ void queued_spin_lock_slowpath(struct qs
cpu_relax();

arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);

release:
/*
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,9 +21,14 @@

#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

+/*
+ * Queue node uses: vcpu_running & vcpu_halted.
+ * Queue head uses: vcpu_running & vcpu_hashed.
+ */
enum vcpu_state {
vcpu_running = 0,
- vcpu_halted,
+ vcpu_halted, /* Used only in pv_wait_node */
+ vcpu_hashed, /* = pv_hash'ed + vcpu_halted */
};

struct pv_node {
@@ -152,7 +157,8 @@ static void pv_init_node(struct mcs_spin

/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
*/
static void pv_wait_node(struct mcs_spinlock *node)
{
@@ -171,9 +177,9 @@ static void pv_wait_node(struct mcs_spin
*
* [S] pn->state = vcpu_halted [S] next->locked = 1
* MB MB
- * [L] pn->locked [RmW] pn->state = vcpu_running
+ * [L] pn->locked [RmW] pn->state = vcpu_hashed
*
- * Matches the xchg() from pv_kick_node().
+ * Matches the cmpxchg() from pv_kick_node().
*/
smp_store_mb(pn->state, vcpu_halted);

@@ -181,9 +187,10 @@ static void pv_wait_node(struct mcs_spin
pv_wait(&pn->state, vcpu_halted);

/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * If pv_kick_node() changed us to vcpu_hashed, retain that value
+ * so that pv_wait_head() knows to not also try to hash this lock.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);

/*
* If the locked flag is still not set after wakeup, it is a
@@ -193,6 +200,7 @@ static void pv_wait_node(struct mcs_spin
* MCS lock will be released soon.
*/
}
+
/*
* By now our node->locked should be 1 and our caller will not actually
* spin-wait for it. We do however rely on our caller to do a
@@ -201,24 +209,35 @@ static void pv_wait_node(struct mcs_spin
}

/*
- * Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state such
+ * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct __qspinlock *l = (void *)lock;

/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * If the vCPU is indeed halted, advance its state to match that of
+ * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+ * observe its next->locked value and advance itself.
*
- * This should be fine however, kicking people for no reason is
- * harmless.
+ * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ */
+ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ return;
+
+ /*
+ * Put the lock into the hash table and set the _Q_SLOW_VAL.
*
- * See the comment in pv_wait_node().
+ * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+ * the hash table later on at unlock time, no atomic instruction is
+ * needed.
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
- pv_kick(pn->cpu);
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
}

/*
@@ -232,6 +251,13 @@ static void pv_wait_head(struct qspinloc
struct qspinlock **lp = NULL;
int loop;

+ /*
+ * If pv_kick_node() already advanced our state, we don't need to
+ * insert ourselves into the hash table anymore.
+ */
+ if (READ_ONCE(pn->state) == vcpu_hashed)
+ lp = (struct qspinlock **)1;
+
for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -239,9 +265,16 @@ static void pv_wait_head(struct qspinloc
cpu_relax();
}

- WRITE_ONCE(pn->state, vcpu_halted);
+ /*
+ * Either pv_kick_node() advanced us and ->state is already
+ * vcpu_hashed and this store is superfluous, or this is the
+ * first in which case the below cmpxchg() provides the
+ * required barriers.
+ */
+ WRITE_ONCE(pn->state, vcpu_hashed);
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
+
/*
* lp must be set before setting _Q_SLOW_VAL
*
@@ -310,8 +343,11 @@ __visible void __pv_queued_spin_unlock(s
/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ * The other vCPU may not really be halted, but kicking an active
+ * vCPU is harmless other than the additional latency in completing
+ * the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
+ if (READ_ONCE(node->state) == vcpu_hashed)
pv_kick(node->cpu);
}
/*

2015-07-13 13:52:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead

On Sat, Jul 11, 2015 at 04:36:53PM -0400, Waiman Long wrote:
> Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
> critical section and block forward progress. This patch implements
> a kick-ahead mechanism where the unlocker will kick the queue head
> vCPUs as well as up to two additional vCPUs next to the queue head if
> they were halted. The kickings are done after exiting the critical
> section to improve parallelism.
>
> The amount of kick-ahead allowed depends on the number of vCPUs in
> the VM guest. This change should improve overall system performance
> in a busy overcommitted guest.

-ENONUMBERS... also highly workload sensitive, if the lock hold time is
just above our spin time you're wasting gobs of runtime.

2015-07-13 19:50:31

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 3/7] locking/pvqspinlock: Implement wait-early for overcommitted guest

On Sat, 2015-07-11 at 16:36 -0400, Waiman Long wrote:
> /*
> + * Queued Spinlock Spin Thresholds
> + * -------------------------------
> + * Because of the cacheline contention effect of the ticket spinlock, the
> + * same spin threshold for queued spinlock will run a bit faster. So we set
> + * a slight larger threshold for the queue head (1.25X) while the other queue
> + * nodes will keep the same threshold.

How did you come to 1.25x? These sort of magic numbers scare me.

2015-07-14 09:31:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On Mon, Jul 13, 2015 at 03:48:28PM +0200, Peter Zijlstra wrote:
> @@ -239,9 +265,16 @@ static void pv_wait_head(struct qspinloc
> cpu_relax();
> }
>
> - WRITE_ONCE(pn->state, vcpu_halted);
> + /*
> + * Either pv_kick_node() advanced us and ->state is already
> + * vcpu_hashed and this store is superfluous, or this is the
> + * first in which case the below cmpxchg() provides the
> + * required barriers.
> + */
> + WRITE_ONCE(pn->state, vcpu_hashed);

The easier option is of course to just move this store into the branch
below. That immediately clarifies the logic and avoids the superfluous
store.

> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> +
> /*
> * lp must be set before setting _Q_SLOW_VAL
> *

2015-07-14 18:47:23

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 6/7] locking/qspinlock: A fairer queued unfair lock

On 07/12/2015 04:21 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:57PM -0400, Waiman Long wrote:
>> For a virtual guest with the qspinlock patch, a simple unfair byte lock
>> will be used if PV spinlock is not configured in or the hypervisor
>> isn't either KVM or Xen.
> Why do we care about this case enough to add over 300 lines of code?

From my testing, I found the queued unfair lock to be superior to both
the byte lock or the PV qspinlock when the VM is overcommitted. My
current opinion is to use PV qspinlock for VMs that are not likely to
run into the overcommited problem. For other VMs that are overcommitted,
it will be better to use the queued unfair lock. However, this is a
choice that the system administrators have to made. That is also the
reason why I sent out another patch to add a KVM command line option to
disable PV spinlock like what Xen already has. In this way, depending on
how the kernel is booted, we can choose either PV qspinlock or queued
unfair lock.

Cheers,
Longman

2015-07-14 18:48:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 5/7] locking/pvqspinlock: Add pending bit support

On 07/12/2015 04:21 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:56PM -0400, Waiman Long wrote:
>> Like the native qspinlock, using the pending bit when it is lightly
>> loaded to acquire the lock is faster than going through the PV queuing
>> process which is even slower than the native queuing process. It also
>> avoids loading two additional cachelines (the MCS and PV nodes).
>>
>> This patch adds the pending bit support for PV qspinlock. The
>> pending bit code has a smaller spin threshold. It will default back
>> to the queuing method if it cannot acquired the lock within a certain
>> time limit.
> -ENONUMBERS

Will send out an updated patchset with patch-to-patch performance numbers.

Cheers,
Longman

2015-07-14 18:48:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 4/7] locking/pvqspinlock: Collect slowpath lock statistics

On 07/12/2015 04:22 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:55PM -0400, Waiman Long wrote:
>> This patch enables the accumulation of kicking and waiting related
>> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
>> option is selected. It also enables the collection of kicking and
>> wakeup latencies which have a heavy dependency on the CPUs being used.
>>
>> The measured latencies for different CPUs are:
>>
>> CPU Wakeup Kicking
>> --- ------ -------
>> Haswell-EX 26.4us 9.2us
>> Westmere-EX 99.4US 25.5us
>>
>> So Haswell is much faster than Westmere.
>>
>> The accumulated lock statistics will be reported in debugfs under the
>> pv-qspinlock directory.
> You've again stuck this behind non-trivial patches :-(

I will move that patch into the front in the new patchset.

Cheers,
Longman

2015-07-14 20:45:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] locking/qspinlock: A fairer queued unfair lock

On Tue, Jul 14, 2015 at 02:47:16PM -0400, Waiman Long wrote:
> On 07/12/2015 04:21 AM, Peter Zijlstra wrote:
> >On Sat, Jul 11, 2015 at 04:36:57PM -0400, Waiman Long wrote:
> >>For a virtual guest with the qspinlock patch, a simple unfair byte lock
> >>will be used if PV spinlock is not configured in or the hypervisor
> >>isn't either KVM or Xen.
> >Why do we care about this case enough to add over 300 lines of code?
>
> From my testing, I found the queued unfair lock to be superior to both the
> byte lock or the PV qspinlock when the VM is overcommitted. My current
> opinion is to use PV qspinlock for VMs that are not likely to run into the
> overcommited problem. For other VMs that are overcommitted, it will be
> better to use the queued unfair lock. However, this is a choice that the
> system administrators have to made. That is also the reason why I sent out
> another patch to add a KVM command line option to disable PV spinlock like
> what Xen already has. In this way, depending on how the kernel is booted, we
> can choose either PV qspinlock or queued unfair lock.

No, we're not going to add another 300 line lock implementation and a
knob.

2015-07-15 01:24:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On 07/13/2015 08:02 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
>> @@ -181,9 +187,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
>> pv_wait(&pn->state, vcpu_halted);
>>
>> /*
>> - * Reset the vCPU state to avoid unncessary CPU kicking
>> + * Reset the state except when vcpu_hashed is set.
>> */
>> - WRITE_ONCE(pn->state, vcpu_running);
>> + cmpxchg(&pn->state, vcpu_halted, vcpu_running);
> Why? Suppose we did get advanced into the hashed state, and then get a
> (spurious) wakeup, this means we'll observe our ->locked == 1 condition
> and fall out of pv_wait_node().
>
> We'll then enter pv_wait_head(), which with your modification:
>
>> @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> {
>> struct pv_node *pn = (struct pv_node *)node;
>> struct __qspinlock *l = (void *)lock;
>> - struct qspinlock **lp = NULL;
>> + struct qspinlock **lp;
>> int loop;
>>
>> + /*
>> + * Initialize lp to a non-NULL value if it has already been in the
>> + * pv_hashed state so that pv_hash() won't be called again.
>> + */
>> + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
>> + : NULL;
>> for (;;) {
>> + WRITE_ONCE(pn->state, vcpu_running);
> Will instantly and unconditionally write vcpu_running.
>
>

This code is kind of complicated. I am going to get rid of the current
tri-state setup, and switch to a separate sync variable for defer kicking.

Cheers,
Longman

2015-07-15 01:31:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

On 07/13/2015 09:48 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
>> @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> {
>> struct pv_node *pn = (struct pv_node *)node;
>> struct __qspinlock *l = (void *)lock;
>> - struct qspinlock **lp = NULL;
>> + struct qspinlock **lp;
>> int loop;
>>
>> + /*
>> + * Initialize lp to a non-NULL value if it has already been in the
>> + * pv_hashed state so that pv_hash() won't be called again.
>> + */
>> + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
>> + : NULL;
>> for (;;) {
>> + WRITE_ONCE(pn->state, vcpu_running);
>> for (loop = SPIN_THRESHOLD; loop; loop--) {
>> if (!READ_ONCE(l->locked))
>> return;
>> cpu_relax();
>> }
>>
>> - WRITE_ONCE(pn->state, vcpu_halted);
>> + /*
>> + * Recheck lock value after setting vcpu_hashed state
>> + *
>> + * [S] state = vcpu_hashed [S] l->locked = 0
>> + * MB MB
>> + * [L] l->locked [L] state == vcpu_hashed
>> + *
>> + * Matches smp_store_mb() in __pv_queued_spin_unlock()
>> + */
>> + smp_store_mb(pn->state, vcpu_hashed);
>> +
>> + if (!READ_ONCE(l->locked)) {
>> + WRITE_ONCE(pn->state, vcpu_running);
>> + return;
>> + }
>> +
>> if (!lp) { /* ONCE */
>> lp = pv_hash(lock, pn);
>> +
>> /*
>> * lp must be set before setting _Q_SLOW_VAL
>> *
>> @@ -305,13 +343,16 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
>> * Now that we have a reference to the (likely) blocked pv_node,
>> * release the lock.
>> */
>> - smp_store_release(&l->locked, 0);
>> + smp_store_mb(l->locked, 0);
>>
>> /*
>> * At this point the memory pointed at by lock can be freed/reused,
>> * however we can still use the pv_node to kick the CPU.
>> + * The other vCPU may not really be halted, but kicking an active
>> + * vCPU is harmless other than the additional latency in completing
>> + * the unlock.
>> */
>> - if (READ_ONCE(node->state) == vcpu_halted)
>> + if (READ_ONCE(node->state) == vcpu_hashed)
>> pv_kick(node->cpu);
>> }
> I think most of that is not actually required; if we let pv_kick_node()
> set vcpu_hashed and avoid writing another value in pv_wait_head(), then
> __pv_queued_spin_unlock() has two cases:
>
> - pv_kick_node() set _SLOW_VAL, which is the same 'thread' and things
> observe program order and we're trivially guaranteed to see
> node->state and the hash state.

I just found out that letting pv_kick_node() to wakeup vCPUs at locking
time can have a slightly better performance in some cases. So I am going
to keep it, but defer kicking to the unlock time when we can do multiple
kicks. The advantage of doing it at unlock time is that the kicking can
be done outside of the critical section. So I am going to keep the
current name.

Cheers,
Longman

2015-07-15 01:38:25

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead

On 07/13/2015 09:52 AM, Peter Zijlstra wrote:
> On Sat, Jul 11, 2015 at 04:36:53PM -0400, Waiman Long wrote:
>> Frequent CPU halting (vmexit) and CPU kicking (vmenter) lengthens
>> critical section and block forward progress. This patch implements
>> a kick-ahead mechanism where the unlocker will kick the queue head
>> vCPUs as well as up to two additional vCPUs next to the queue head if
>> they were halted. The kickings are done after exiting the critical
>> section to improve parallelism.
>>
>> The amount of kick-ahead allowed depends on the number of vCPUs in
>> the VM guest. This change should improve overall system performance
>> in a busy overcommitted guest.
> -ENONUMBERS... also highly workload sensitive, if the lock hold time is
> just above our spin time you're wasting gobs of runtime.

Currently SPIN_THRESHOLD is (1<<15). My test of the pause instruction is
about 3ns in my 2.5GHz Haswell-EX system. That translates to a spinning
time of at least 100us. I don't think we have critical sections in the
kernel that take that long. I also found out that the kick-to-wakeup
time can be pretty long, kicking ahead will enable the next cpu waiting
in line to get the lock faster as some of the wakeup time will overlap
with the critical sections of previous CPUs. Will have the performance
number in the v2 patch.

Cheers,
Longman

2015-07-15 01:39:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 3/7] locking/pvqspinlock: Implement wait-early for overcommitted guest

On 07/13/2015 03:50 PM, Davidlohr Bueso wrote:
> On Sat, 2015-07-11 at 16:36 -0400, Waiman Long wrote:
>> /*
>> + * Queued Spinlock Spin Thresholds
>> + * -------------------------------
>> + * Because of the cacheline contention effect of the ticket spinlock, the
>> + * same spin threshold for queued spinlock will run a bit faster. So we set
>> + * a slight larger threshold for the queue head (1.25X) while the other queue
>> + * nodes will keep the same threshold.
> How did you come to 1.25x? These sort of magic numbers scare me.
>

Yes, this is kind of arbitrary. I am going to get rid of it in the v2 patch.

Cheers,
Longman

Subject: [tip:locking/core] locking/pvqspinlock: Only kick CPU at unlock time

Commit-ID: 75d2270280686bff21b9ba66c7f3dd379c887981
Gitweb: http://git.kernel.org/tip/75d2270280686bff21b9ba66c7f3dd379c887981
Author: Waiman Long <[email protected]>
AuthorDate: Sat, 11 Jul 2015 16:36:52 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 10:57:11 +0200

locking/pvqspinlock: Only kick CPU at unlock time

For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/qspinlock.c | 6 ++--
kernel/locking/qspinlock_paravirt.h | 66 +++++++++++++++++++++++++++----------
2 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..337c881 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock *lock)

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

@@ -440,7 +440,7 @@ queue:
cpu_relax();

arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);

release:
/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index ab8b1bb..c8e6e9a 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -22,9 +22,14 @@

#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

+/*
+ * Queue node uses: vcpu_running & vcpu_halted.
+ * Queue head uses: vcpu_running & vcpu_hashed.
+ */
enum vcpu_state {
vcpu_running = 0,
- vcpu_halted,
+ vcpu_halted, /* Used only in pv_wait_node */
+ vcpu_hashed, /* = pv_hash'ed + vcpu_halted */
};

struct pv_node {
@@ -153,7 +158,8 @@ static void pv_init_node(struct mcs_spinlock *node)

/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
*/
static void pv_wait_node(struct mcs_spinlock *node)
{
@@ -172,9 +178,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
*
* [S] pn->state = vcpu_halted [S] next->locked = 1
* MB MB
- * [L] pn->locked [RmW] pn->state = vcpu_running
+ * [L] pn->locked [RmW] pn->state = vcpu_hashed
*
- * Matches the xchg() from pv_kick_node().
+ * Matches the cmpxchg() from pv_kick_node().
*/
smp_store_mb(pn->state, vcpu_halted);

@@ -182,9 +188,10 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(&pn->state, vcpu_halted);

/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * If pv_kick_node() changed us to vcpu_hashed, retain that value
+ * so that pv_wait_head() knows to not also try to hash this lock.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);

/*
* If the locked flag is still not set after wakeup, it is a
@@ -194,6 +201,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* MCS lock will be released soon.
*/
}
+
/*
* By now our node->locked should be 1 and our caller will not actually
* spin-wait for it. We do however rely on our caller to do a
@@ -202,24 +210,35 @@ static void pv_wait_node(struct mcs_spinlock *node)
}

/*
- * Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state such
+ * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct __qspinlock *l = (void *)lock;

/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * If the vCPU is indeed halted, advance its state to match that of
+ * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+ * observe its next->locked value and advance itself.
*
- * This should be fine however, kicking people for no reason is
- * harmless.
+ * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ */
+ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ return;
+
+ /*
+ * Put the lock into the hash table and set the _Q_SLOW_VAL.
*
- * See the comment in pv_wait_node().
+ * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+ * the hash table later on at unlock time, no atomic instruction is
+ * needed.
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
- pv_kick(pn->cpu);
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
}

/*
@@ -233,6 +252,13 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct qspinlock **lp = NULL;
int loop;

+ /*
+ * If pv_kick_node() already advanced our state, we don't need to
+ * insert ourselves into the hash table anymore.
+ */
+ if (READ_ONCE(pn->state) == vcpu_hashed)
+ lp = (struct qspinlock **)1;
+
for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -240,9 +266,10 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
cpu_relax();
}

- WRITE_ONCE(pn->state, vcpu_halted);
if (!lp) { /* ONCE */
+ WRITE_ONCE(pn->state, vcpu_hashed);
lp = pv_hash(lock, pn);
+
/*
* We must hash before setting _Q_SLOW_VAL, such that
* when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
@@ -333,8 +360,11 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ * The other vCPU may not really be halted, but kicking an active
+ * vCPU is harmless other than the additional latency in completing
+ * the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
+ if (READ_ONCE(node->state) == vcpu_hashed)
pv_kick(node->cpu);
}
/*