Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756509AbcC2Juw (ORCPT ); Tue, 29 Mar 2016 05:50:52 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:39793 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752047AbcC2Jut (ORCPT ); Tue, 29 Mar 2016 05:50:49 -0400 Date: Tue, 29 Mar 2016 11:50:39 +0200 From: Peter Zijlstra To: Mike Galbraith Cc: Petros Koutoupis , linux-kernel@vger.kernel.org, "Paul E. McKenney" , Thomas Gleixner , Davidlohr Bueso , Linus Torvalds , catalin.marinas@arm.com, "H. Peter Anvin" Subject: Re: futex: clarification needed with drop_futex_key_refs and memory barriers Message-ID: <20160329095039.GE3408@twins.programming.kicks-ass.net> References: <1459007812.5648.5.camel@petros-ultrathin> <1459059948.3799.14.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1459059948.3799.14.camel@gmail.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4726 Lines: 116 On Sun, Mar 27, 2016 at 08:25:48AM +0200, Mike Galbraith wrote: > (futex ordering pop-flare) > > On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote: > > I stumbled on an interesting scenario which I am unable to fully explain and I > > was hoping to get some other opinions on why this would or wouldn't work. > > > > In recent testing on a 48-core Haswell arch server, our multi-threaded user space > > application was utilizing 60% to 100% more CPU than on our smaller 24-core servers > > (running an identical load). After spending a considerable amount of time analyzing > > stack dumps and straces it became immediately apparent that those exact threads > > operating with the higher CPU utilization were off in futex land. perf should be able to tell you that pretty quickly, no? A question about your workload; are you typically 'stuck' in futex_wait() or some of the requeue stuff (which is more complex). Let me assume regular futex_wait() for now. > > Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0 > > (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the > > handling of private futexes and preventing a race condition by completing the > > function with a memory barrier. Now, I fully understand why this patch was implemented: > > to have a memory barrier before checking the "waiters." It makes sense. Right; well, a barrier isn't _before_ something, its between two things. And I think we're ordering the user-space store to the futex value against the waiters load, but the code could do with a comment here. And this should be totally irrelevant for x86 because a CPL change already implies a full barrier. > > What doesn't > > make sense (so far) is when I apply the same patch to the drop counterpart, > > drop_futex_key_refs(), and the problem goes away. See the change and my notes below. > > > > > > --- linux/kernel/futex.c.orig 2016-03-25 19:45:08.169563263 -0500 > > +++ linux/kernel/futex.c 2016-03-25 19:46:06.901562211 -0500 > > @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu > > > > switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { > > case FUT_OFF_INODE: > > - iput(key->shared.inode); > > + iput(key->shared.inode); /* implies smp_mb(); (B) */ > > break; > > case FUT_OFF_MMSHARED: > > - mmdrop(key->private.mm); > > + mmdrop(key->private.mm); /* implies smp_mb(); (B) */ > > break; > > + default: > > + smp_mb(); /* explicit smp_mb(); (B) */ > > } > > } So the patch makes sense from a symmetry POV, but I'm having a hard time explaining why this would make any difference to your workload. The best I can come up with is that the explicit MFENCE (as opposed to a LOCK prefixed instruction) causes a store-buffer flush and hence the waker more often sees !waiters and therefore tries to acquire the bucket lock less. But this is not a correctness (nor ordering) issue; but purely an architectural side-effect. Furthermore; some proposed changes: http://marc.info/?l=linux-kernel&m=145400059704564&w=2 might change this side-effect. In any case; the below (completely irrelevant patch for you) is something I would propose. It gives hb_waiter_dec() RELEASE like semantics and ensures it cannot creep into the lock sections its typically behind. Although strictly speaking I think it being inside that lock region is sufficient. It also re-orders the increment in requeue to happen before we add to the list (as is the proper order) and removes a totally bogus comment; spin_lock() does _NOT_ imply a full barrier. --- kernel/futex.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 5d6ce6413ef1..615f277f313b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -360,6 +360,7 @@ static inline void hb_waiters_inc(struct futex_hash_bucket *hb) static inline void hb_waiters_dec(struct futex_hash_bucket *hb) { #ifdef CONFIG_SMP + smp_mb__before_atomic(); atomic_dec(&hb->waiters); #endif } @@ -1442,8 +1443,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, if (likely(&hb1->chain != &hb2->chain)) { plist_del(&q->list, &hb1->chain); hb_waiters_dec(hb1); - plist_add(&q->list, &hb2->chain); hb_waiters_inc(hb2); + plist_add(&q->list, &hb2->chain); q->lock_ptr = &hb2->lock; } get_futex_key_refs(key2); @@ -1864,7 +1865,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) q->lock_ptr = &hb->lock; - spin_lock(&hb->lock); /* implies MB (A) */ + spin_lock(&hb->lock); return hb; }