Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754181AbbKBQgg (ORCPT ); Mon, 2 Nov 2015 11:36:36 -0500 Received: from bombadil.infradead.org ([198.137.202.9]:59144 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753382AbbKBQge (ORCPT ); Mon, 2 Nov 2015 11:36:34 -0500 Date: Mon, 2 Nov 2015 17:36:26 +0100 From: Peter Zijlstra To: Waiman Long Cc: Ingo Molnar , Thomas Gleixner , "H. Peter Anvin" , x86@kernel.org, linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Davidlohr Bueso Subject: Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline Message-ID: <20151102163626.GU3604@twins.programming.kicks-ass.net> References: <1446247597-61863-1-git-send-email-Waiman.Long@hpe.com> <1446247597-61863-3-git-send-email-Waiman.Long@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1446247597-61863-3-git-send-email-Waiman.Long@hpe.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3007 Lines: 90 On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote: > A queue head CPU, after acquiring the lock, will have to notify > the next CPU in the wait queue that it has became the new queue > head. This involves loading a new cacheline from the MCS node of the > next CPU. That operation can be expensive and add to the latency of > locking operation. > > This patch addes code to optmistically prefetch the next MCS node > cacheline if the next pointer is defined and it has been spinning > for the MCS lock for a while. This reduces the locking latency and > improves the system throughput. > > Using a locking microbenchmark on a Haswell-EX system, this patch > can improve throughput by about 5%. How does it affect IVB-EX (which you were testing earlier IIRC)? > Signed-off-by: Waiman Long > --- > kernel/locking/qspinlock.c | 21 +++++++++++++++++++++ > 1 files changed, 21 insertions(+), 0 deletions(-) > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index 7868418..c1c8a1a 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -396,6 +396,7 @@ queue: > * p,*,* -> n,*,* > */ > old = xchg_tail(lock, tail); > + next = NULL; > > /* > * if there was a previous node; link it and wait until reaching the > @@ -407,6 +408,16 @@ queue: > > pv_wait_node(node); > arch_mcs_spin_lock_contended(&node->locked); > + > + /* > + * While waiting for the MCS lock, the next pointer may have > + * been set by another lock waiter. We optimistically load > + * the next pointer & prefetch the cacheline for writing > + * to reduce latency in the upcoming MCS unlock operation. > + */ > + next = READ_ONCE(node->next); > + if (next) > + prefetchw(next); > } OK so far I suppose. Since we already read node->locked, which is in the same cacheline, also reading node->next isn't extra pressure. And we can then prefetch that cacheline. > /* > @@ -426,6 +437,15 @@ queue: > cpu_relax(); > > /* > + * If the next pointer is defined, we are not tail anymore. > + * In this case, claim the spinlock & release the MCS lock. > + */ > + if (next) { > + set_locked(lock); > + goto mcs_unlock; > + } > + > + /* > * claim the lock: > * > * n,0,0 -> 0,0,1 : lock, uncontended > @@ -458,6 +478,7 @@ queue: > while (!(next = READ_ONCE(node->next))) > cpu_relax(); > > +mcs_unlock: > arch_mcs_spin_unlock_contended(&next->locked); > pv_kick_node(lock, next); > This however appears an independent optimization. Is it worth it? Would we not already have observed a val != tail in this case? At which point we're just adding extra code for no gain. That is, if we observe @next, must we then not also observe val != tail? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/