Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751197AbbHACWY (ORCPT ); Fri, 31 Jul 2015 22:22:24 -0400 Received: from g4t3427.houston.hp.com ([15.201.208.55]:60957 "EHLO g4t3427.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750856AbbHACWW (ORCPT ); Fri, 31 Jul 2015 22:22:22 -0400 From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , "H. Peter Anvin" Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Davidlohr Bueso , Waiman Long Subject: [PATCH v4 0/7] locking/qspinlock: Enhance pvqspinlock performance Date: Fri, 31 Jul 2015 22:21:57 -0400 Message-Id: <1438395724-25910-1-git-send-email-Waiman.Long@hp.com> X-Mailer: git-send-email 1.7.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8863 Lines: 235 v3->v4: - Patch 1: add comment about possible racing condition in PV unlock. - Patch 2: simplified the pv_pending_lock() function as suggested by Davidlohr. - Move PV unlock optimization patch forward to patch 4 & rerun performance test. v2->v3: - Moved deferred kicking enablement patch forward & move back the kick-ahead patch to make the effect of kick-ahead more visible. - Reworked patch 6 to make it more readable. - Reverted back to use state as a tri-state variable instead of adding an additional bistate variable. - Added performance data for different values of PV_KICK_AHEAD_MAX. - Add a new patch to optimize PV unlock code path performance. v1->v2: - Take out the queued unfair lock patches - Add a patch to simplify the PV unlock code - Move pending bit and statistics collection patches to the front - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time when appropriate. - Change the wait-early patch to use adaptive spinning to better balance the difference effect on normal and over-committed guests. - Add patch-to-patch performance changes in the patch commit logs. This patchset tries to improve the performance of both normal and over-commmitted VM guests. The kick-ahead and adaptive spinning patches are inspired by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap. Patch 1 simplifies the unlock code by doing unconditional vCPU kick when _Q_SLOW_VAL is set as the chance of spurious wakeup showing up in the statistical data that I collected was very low (1 or 2 occasionally). Patch 2 adds pending bit support to pvqspinlock improving performance at light load. Patch 3 allows the collection of various count data that are useful to see what is happening in the system. They do add a bit of overhead when enabled slowing performance a tiny bit. Patch 4 optimizes the PV unlock code path performance for x86-64 architecture. Patch 5 is an enablement patch for deferring vCPU kickings from the lock side to the unlock side. Patch 6 enables multiple vCPU kick-ahead's at unlock time, outside of the critical section which can improve performance in overcommitted guests and sometime even in normal guests. Patch 7 enables adaptive spinning in the queue nodes. This patch can lead to pretty big performance increase in over-committed guest at the expense of a slight performance hit in normal guests. Patches 2 & 4 improves performance of common uncontended and lightly contended cases. Patches 5-7 are for improving performance in over-committed VM guests. Performance measurements were done on a 32-CPU Westmere-EX and Haswell-EX systems. The Westmere-EX system got the most performance gain from patch 5, whereas the Haswell-EX system got the most gain from patch 6 for over-committed guests. The table below shows the Linux kernel build times for various values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on the Westmere-EX system: PV_KICK_AHEAD_MAX Patches 1-5 Patches 1-6 ----------------- ----------- ----------- 1 9m46.9s 11m10.1s 2 9m40.2s 10m08.3s 3 9m36.8s 9m49.8s 4 9m35.9s 9m38.7s 5 9m35.1s 9m33.0s 6 9m35.7s 9m28.5s With patches 1-5, the performance wasn't very sensitive to different PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes the picture quite dramatically. There is a performance regression if PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing PV_KICK_AHEAD_MAX only gets us a minor benefit. Waiman Long (7): locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL locking/pvqspinlock: Add pending bit support locking/pvqspinlock: Collect slowpath lock statistics locking/pvqspinlock, x86: Optimize PV unlock code path locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call locking/pvqspinlock: Allow vCPUs kick-ahead locking/pvqspinlock: Queue node adaptive spinning arch/x86/Kconfig | 7 + arch/x86/include/asm/qspinlock_paravirt.h | 59 +++ kernel/locking/qspinlock.c | 38 ++- kernel/locking/qspinlock_paravirt.h | 546 +++++++++++++++++++++++++++-- 4 files changed, 612 insertions(+), 38 deletions(-) v3-to-v4 diff: arch/x86/include/asm/qspinlock_paravirt.h | 5 +-- kernel/locking/qspinlock_paravirt.h | 45 +++++++++++++++++----------- 2 files changed, 29 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h index 46f0f82..3001972 100644 --- a/arch/x86/include/asm/qspinlock_paravirt.h +++ b/arch/x86/include/asm/qspinlock_paravirt.h @@ -9,10 +9,10 @@ */ #ifdef CONFIG_64BIT -PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath); +PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath); #define __pv_queued_spin_unlock __pv_queued_spin_unlock #define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock" -#define PV_UNLOCK_SLOWPATH "__raw_callee_save_pv_queued_spin_unlock_slowpath" +#define PV_UNLOCK_SLOWPATH "__raw_callee_save___pv_queued_spin_unlock_slowpath" /* * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock @@ -46,7 +46,6 @@ asm(".pushsection .text;" "jne .slowpath;" "pop %rdx;" "ret;" - "nop;" ".slowpath: " "push %rsi;" "movzbl %al,%esi;" diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 56c3717..d04911b 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -463,16 +463,16 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val) /* * wait for in-progress pending->locked hand-overs */ - if (val == _Q_PENDING_VAL) { - while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) && - loop--) - cpu_relax(); + while ((val == _Q_PENDING_VAL) && loop) { + cpu_relax(); + val = atomic_read(&lock->val); + loop--; } /* * trylock || pending */ - for (;;) { + for (;; loop--) { if (val & ~_Q_LOCKED_MASK) goto queue; new = _Q_LOCKED_VAL; @@ -481,7 +481,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val) old = atomic_cmpxchg(&lock->val, val, new); if (old == val) break; - if (loop-- <= 0) + if (!loop) goto queue; } @@ -490,16 +490,18 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val) /* * We are pending, wait for the owner to go away. */ - while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK) - && (loop-- > 0)) - cpu_relax(); - - if (!(val & _Q_LOCKED_MASK)) { - clear_pending_set_locked(lock); - goto gotlock; + for (; loop; loop--, cpu_relax()) { + val = smp_load_acquire(&lock->val.counter); + if (!(val & _Q_LOCKED_MASK)) { + clear_pending_set_locked(lock); + pvstat_inc(pvstat_pend_lock); + goto gotlock; + } } + /* - * Clear the pending bit and fall back to queuing + * Fail to acquire the lock within the spinning period, + * so clear the pending bit and fall back to queuing. */ clear_pending(lock); pvstat_inc(pvstat_pend_fail); @@ -719,11 +721,11 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) } /* - * PV version of the unlock fastpath and slowpath functions to be used - * in stead of queued_spin_unlock(). + * PV versions of the unlock fastpath and slowpath functions to be used + * instead of queued_spin_unlock(). */ __visible void -pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval) +__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval) { struct __qspinlock *l = (void *)lock; struct pv_node *node, *next; @@ -771,6 +773,13 @@ pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval) /* * At this point the memory pointed at by lock can be freed/reused, * however we can still use the pv_node to kick the CPU. + * + * As smp_store_release() is not a full barrier, adding a check to + * the node->state doesn't guarantee the checking is really done + * after clearing the lock byte since they are in 2 separate + * cachelines and so hardware can reorder them. So either we insert + * memory barrier here and in the corresponding pv_wait_head() + * function or we do an unconditional kick which is what is done here. */ pvstat_inc(pvstat_unlock_kick); pv_kick(node->cpu); @@ -803,6 +812,6 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock) if (likely(lockval == _Q_LOCKED_VAL)) return; - pv_queued_spin_unlock_slowpath(lock, lockval); + __pv_queued_spin_unlock_slowpath(lock, lockval); } #endif /* __pv_queued_spin_unlock */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/