Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752447AbbGOCOs (ORCPT ); Tue, 14 Jul 2015 22:14:48 -0400 Received: from g4t3425.houston.hp.com ([15.201.208.53]:44941 "EHLO g4t3425.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752261AbbGOCOA (ORCPT ); Tue, 14 Jul 2015 22:14:00 -0400 From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , "H. Peter Anvin" Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Davidlohr Bueso , Waiman Long Subject: [PATCH v2 6/6] locking/pvqspinlock: Queue node adaptive spinning Date: Tue, 14 Jul 2015 22:13:37 -0400 Message-Id: <1436926417-20256-7-git-send-email-Waiman.Long@hp.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1436926417-20256-1-git-send-email-Waiman.Long@hp.com> References: <1436926417-20256-1-git-send-email-Waiman.Long@hp.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8246 Lines: 232 In an overcommitted guest where some vCPUs have to be halted to make forward progress in other areas, it is highly likely that a vCPU later in the spinlock queue will be spinning while the ones earlier in the queue would have been halted. The spinning in the later vCPUs is then just a waste of precious CPU cycles because they are not going to get the lock soon as the earlier ones have to be woken up and take their turn to get the lock. Reducing the spinning threshold is found to improve performance in an overcommitted VM guest, but decrease performance when there is no overcommittment. This patch implements an adaptive spinning mechanism where the vCPU will call pv_wait() earlier under the following 3 conditions: 1) the previous vCPU is in the halted state; 2) the current vCPU is at least 2 nodes away from the lock holder; 3) there are a lot of pv_wait() for the current vCPU recently. If all the conditions are true, it will spin less before calling pv_wait(). Linux kernel builds were run in KVM guest on an 8-socket, 4 cores/socket Westmere-EX system and a 4-socket, 8 cores/socket Haswell-EX system. Both systems are configured to have 32 physical CPUs. The kernel build times before and after the patch were: Westmere Haswell Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs ----- -------- -------- -------- -------- Before patch 3m01.3s 9m50.9s 2m00.5s 13m28.1s After patch 3m06.2s 9m38.0s 2m01.8s 9m06.9s This patch seemed to cause a little bit of performance degraduation for 32 vCPUs. For 48 vCPUs, there was some performance for Westmere and a big performance jump for Haswell. Signed-off-by: Waiman Long --- kernel/locking/qspinlock.c | 5 +- kernel/locking/qspinlock_paravirt.h | 93 ++++++++++++++++++++++++++++++++-- 2 files changed, 90 insertions(+), 8 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 94fdd27..da39d43 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -258,7 +258,8 @@ static __always_inline void set_locked(struct qspinlock *lock) */ static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } -static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(struct mcs_spinlock *node, + struct mcs_spinlock *prev) { } static __always_inline void __pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) { } static __always_inline void __pv_wait_head(struct qspinlock *lock, @@ -415,7 +416,7 @@ queue: prev = decode_tail(old); WRITE_ONCE(prev->next, node); - pv_wait_node(node); + pv_wait_node(node, prev); arch_mcs_spin_lock_contended(&node->locked); } diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 1f0485c..5f97058 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -22,12 +22,37 @@ #define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET) /* - * Queued Spinlock Spin Threshold + * Queued Spinlock Spin Thresholds * * The vCPU will spin a relatively short time in pending mode before falling * back to queuing. + * + * Queue Node Adaptive Spinning + * + * A queue node vCPU will spin less if the following conditions are true: + * 1) vCPU in the previous node is halted + * 2) it is at least 2 nodes away from the lock holder + * 3) there is a lot of pv_wait() in the curent vCPU recently + * + * The last condition is being monitored by the waithist field in the pv_node + * structure which tracks the history of pv_wait() relative to slowpath calls. + * Each pv_wait will increment this field by PV_WAIT_INC until it exceeds + * PV_WAITHIST_MAX. Each slowpath lock call will decrement it by 1 until it + * reaches PV_WAITHIST_MIN. If its value is higher than PV_WAITHIST_THRESHOLD, + * the vCPU will spin less. The reason for this adaptive spinning is to try + * to enable wait-early when overcommitted, but don't use it when it is not. + * + * The queue node vCPU will monitor the state of the previous node + * periodically to see if there is any change. */ -#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5) +#define PENDING_SPIN_THRESHOLD (SPIN_THRESHOLD >> 5) +#define QNODE_SPIN_THRESHOLD SPIN_THRESHOLD +#define QNODE_SPIN_THRESHOLD_SHORT (QNODE_SPIN_THRESHOLD >> 5) +#define QNODE_SPIN_CHECK_MASK 0xff +#define PV_WAIT_INC 2 +#define PV_WAITHIST_MIN 1 +#define PV_WAITHIST_MAX 30 +#define PV_WAITHIST_THRESHOLD 15 enum vcpu_state { vcpu_running = 0, @@ -41,6 +66,7 @@ struct pv_node { int cpu; u8 state; u8 hashed; /* Set if in hashed table */ + u8 waithist; }; /* @@ -49,6 +75,7 @@ struct pv_node { enum pv_qlock_stat { pvstat_wait_head, pvstat_wait_node, + pvstat_wait_early, pvstat_kick_time, pvstat_lock_kick, pvstat_unlock_kick, @@ -71,6 +98,7 @@ enum pv_qlock_stat { static const char * const stat_fsnames[pvstat_num] = { [pvstat_wait_head] = "wait_head_count", [pvstat_wait_node] = "wait_node_count", + [pvstat_wait_early] = "wait_early_count", [pvstat_kick_time] = "kick_time_count", [pvstat_lock_kick] = "lock_kick_count", [pvstat_unlock_kick] = "unlock_kick_count", @@ -402,19 +430,67 @@ gotlock: /* * Wait for node->locked to become true, halt the vcpu after a short spin. - * pv_kick_node() is used to wake the vcpu again. + * pv_kick_node() is used to wake the vcpu again, but the kicking may also + * be deferred to the unlock time. */ -static void pv_wait_node(struct mcs_spinlock *node) +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) { struct pv_node *pn = (struct pv_node *)node; + struct pv_node *pp = (struct pv_node *)prev; + bool wait_early, can_wait_early; int loop; for (;;) { - for (loop = SPIN_THRESHOLD; loop; loop--) { + /* + * Spin less if the previous vCPU was in the halted state + * and it is not the queue head. + */ + can_wait_early = (pn->waithist > PV_WAITHIST_THRESHOLD); + wait_early = can_wait_early && !READ_ONCE(prev->locked) && + (READ_ONCE(pp->state) == vcpu_halted); + loop = wait_early ? QNODE_SPIN_THRESHOLD_SHORT + : QNODE_SPIN_THRESHOLD; + for (; loop; loop--, cpu_relax()) { + bool halted; + if (READ_ONCE(node->locked)) return; - cpu_relax(); + + if (!can_wait_early || (loop & QNODE_SPIN_CHECK_MASK)) + continue; + + /* + * Look for state transition at previous node. + * + * running => halted: + * call pv_wait() now if kick-ahead is enabled + * or reduce spin threshold to + * QNODE_SPIN_THRESHOLD_SHORT or less. + * halted => running: + * reset spin threshold to QNODE_SPIN_THRESHOLD + */ + halted = (READ_ONCE(pp->state) == vcpu_halted) && + !READ_ONCE(prev->locked); + if (wait_early == halted) + continue; + wait_early = halted; + + if (!wait_early) + loop = QNODE_SPIN_THRESHOLD; + else if (pv_kick_ahead) + break; + else if (loop > QNODE_SPIN_THRESHOLD_SHORT) + loop = QNODE_SPIN_THRESHOLD_SHORT; } + if (wait_early) + pvstat_inc(pvstat_wait_early); + + /* + * A pv_wait while !wait_early has higher weight than when + * wait_early is true. + */ + if (pn->waithist < PV_WAITHIST_MAX) + pn->waithist += wait_early ? 1 : PV_WAIT_INC; /* * Order pn->state vs pn->locked thusly: @@ -538,6 +614,9 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) struct qspinlock **lp = NULL; int loop; + if (pn->waithist > PV_WAITHIST_MIN) + pn->waithist--; /* Pre-decremnt the waithist field */ + for (;;) { for (loop = SPIN_THRESHOLD; loop; loop--) { if (!READ_ONCE(l->locked)) @@ -573,6 +652,8 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) return; } } + if (pn->waithist < PV_WAITHIST_MAX) + pn->waithist += PV_WAIT_INC; pvstat_inc(pvstat_wait_head); pv_wait(&l->locked, _Q_SLOW_VAL); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/