From: Waiman Long <Waiman.Long@hp.com>
To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
        Scott J Norton <scott.norton@hp.com>,
        Douglas Hatch <doug.hatch@hp.com>, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <Waiman.Long@hp.com>
Subject: [PATCH v4 0/7] locking/qspinlock: Enhance pvqspinlock performance
Date: Fri, 31 Jul 2015 22:21:57 -0400
Message-Id: <1438395724-25910-1-git-send-email-Waiman.Long@hp.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8863
Lines: 235

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.

Patch 1 simplifies the unlock code by doing unconditional vCPU kick
when _Q_SLOW_VAL is set as the chance of spurious wakeup showing
up in the statistical data that I collected was very low (1 or 2
occasionally).

Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.

Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 4 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 5 is an enablement patch for deferring vCPU kickings from the
lock side to the unlock side.

Patch 6 enables multiple vCPU kick-ahead's at unlock time, outside of
the critical section which can improve performance in overcommitted
guests and sometime even in normal guests.

Patch 7 enables adaptive spinning in the queue nodes. This patch can
lead to pretty big performance increase in over-committed guest at
the expense of a slight performance hit in normal guests.

Patches 2 & 4 improves performance of common uncontended and lightly
contended cases. Patches 5-7 are for improving performance in
over-committed VM guests.

Performance measurements were done on a 32-CPU Westmere-EX and
Haswell-EX systems. The Westmere-EX system got the most performance
gain from patch 5, whereas the Haswell-EX system got the most gain
from patch 6 for over-committed guests.

The table below shows the Linux kernel build times for various
values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on
the Westmere-EX system:

  PV_KICK_AHEAD_MAX	Patches 1-5	Patches 1-6
  -----------------	-----------	-----------
	  1		  9m46.9s	 11m10.1s
	  2		  9m40.2s	 10m08.3s
	  3		  9m36.8s	  9m49.8s
	  4		  9m35.9s	  9m38.7s
	  5		  9m35.1s	  9m33.0s
	  6		  9m35.7s	  9m28.5s

With patches 1-5, the performance wasn't very sensitive to different
PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes
the picture quite dramatically. There is a performance regression if
PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing
PV_KICK_AHEAD_MAX only gets us a minor benefit.

Waiman Long (7):
  locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
  locking/pvqspinlock: Add pending bit support
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call
  locking/pvqspinlock: Allow vCPUs kick-ahead
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    7 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++
 kernel/locking/qspinlock.c                |   38 ++-
 kernel/locking/qspinlock_paravirt.h       |  546 +++++++++++++++++++++++++++--
 4 files changed, 612 insertions(+), 38 deletions(-)

 v3-to-v4 diff:

 arch/x86/include/asm/qspinlock_paravirt.h |    5 +--
 kernel/locking/qspinlock_paravirt.h       |   45 +++++++++++++++++-----------
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 46f0f82..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -9,10 +9,10 @@
  */
 #ifdef CONFIG_64BIT
 
-PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
 #define __pv_queued_spin_unlock	__pv_queued_spin_unlock
 #define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
-#define PV_UNLOCK_SLOWPATH	"__raw_callee_save_pv_queued_spin_unlock_slowpath"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
 
 /*
  * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
@@ -46,7 +46,6 @@ asm(".pushsection .text;"
     "jne   .slowpath;"
     "pop   %rdx;"
     "ret;"
-    "nop;"
     ".slowpath: "
     "push   %rsi;"
     "movzbl %al,%esi;"
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 56c3717..d04911b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -463,16 +463,16 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 	/*
 	 * wait for in-progress pending->locked hand-overs
 	 */
-	if (val == _Q_PENDING_VAL) {
-		while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
-			loop--)
-			cpu_relax();
+	while ((val == _Q_PENDING_VAL) && loop) {
+		cpu_relax();
+		val = atomic_read(&lock->val);
+		loop--;
 	}
 
 	/*
 	 * trylock || pending
 	 */
-	for (;;) {
+	for (;; loop--) {
 		if (val & ~_Q_LOCKED_MASK)
 			goto queue;
 		new = _Q_LOCKED_VAL;
@@ -481,7 +481,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 		old = atomic_cmpxchg(&lock->val, val, new);
 		if (old == val)
 			break;
-		if (loop-- <= 0)
+		if (!loop)
 			goto queue;
 	}
 
@@ -490,16 +490,18 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 	/*
 	 * We are pending, wait for the owner to go away.
 	 */
-	while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
-		&& (loop-- > 0))
-		cpu_relax();
-
-	if (!(val & _Q_LOCKED_MASK)) {
-		clear_pending_set_locked(lock);
-		goto gotlock;
+	for (; loop; loop--, cpu_relax()) {
+		val = smp_load_acquire(&lock->val.counter);
+		if (!(val & _Q_LOCKED_MASK)) {
+			clear_pending_set_locked(lock);
+			pvstat_inc(pvstat_pend_lock);
+			goto gotlock;
+		}
 	}
+
 	/*
-	 * Clear the pending bit and fall back to queuing
+	 * Fail to acquire the lock within the spinning period,
+	 * so clear the pending bit and fall back to queuing.
 	 */
 	clear_pending(lock);
 	pvstat_inc(pvstat_pend_fail);
@@ -719,11 +721,11 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * PV version of the unlock fastpath and slowpath functions to be used
- * in stead of queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
  */
 __visible void
-pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
 {
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node, *next;
@@ -771,6 +773,13 @@ pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
 	 * however we can still use the pv_node to kick the CPU.
+	 *
+	 * As smp_store_release() is not a full barrier, adding a check to
+	 * the node->state doesn't guarantee the checking is really done
+	 * after clearing the lock byte since they are in 2 separate
+	 * cachelines and so hardware can reorder them. So either we insert
+	 * memory barrier here and in the corresponding pv_wait_head()
+	 * function or we do an unconditional kick which is what is done here.
 	 */
 	pvstat_inc(pvstat_unlock_kick);
 	pv_kick(node->cpu);
@@ -803,6 +812,6 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	if (likely(lockval == _Q_LOCKED_VAL))
 		return;
 
-	pv_queued_spin_unlock_slowpath(lock, lockval);
+	__pv_queued_spin_unlock_slowpath(lock, lockval);
 }
 #endif /* __pv_queued_spin_unlock */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/