Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753299AbcDCLQg (ORCPT ); Sun, 3 Apr 2016 07:16:36 -0400 Received: from mail-lf0-f67.google.com ([209.85.215.67]:33269 "EHLO mail-lf0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752446AbcDCLQe (ORCPT ); Sun, 3 Apr 2016 07:16:34 -0400 Date: Sun, 3 Apr 2016 13:16:28 +0200 From: Ingo Molnar To: Thomas Gleixner , Linus Torvalds , Andrew Morton , Peter Zijlstra Cc: LKML , Sebastian Andrzej Siewior , Darren Hart , Peter Zijlstra , Michael Kerrisk , Davidlohr Bueso , Chris Mason , "Carlos O'Donell" , Torvald Riegel , Eric Dumazet Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes Message-ID: <20160403111628.GA16916@gmail.com> References: <20160402095108.894519835@linutronix.de> <20160402110035.753145539@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160402110035.753145539@linutronix.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4683 Lines: 112 * Thomas Gleixner wrote: > The standard futex mechanism in the Linux kernel uses a global hash to store > transient state. Collisions on that hash can lead to performance degradation > and on real-time enabled kernels even to priority inversions. > > To guarantee futexes without collisions on the global kernel hash, we provide > a mechanism to attach to a futex. This creates futex private state which > avoids hash collisions and on NUMA systems also cross node memory access. > > To utilize this mechanism each thread has to attach to the futex before any > other operations on that futex. > > The inner workings are as follows: > > Attach: > > sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....); > > If this is the first attach to uaddr then a 'global state' object is > created. This global state contains a futex hash bucket and a futex_q > object which is enqueued into the global hash for reference so subsequent > attachers can find it. Each attacher takes a reference count on the > 'global state' object and hashes 'uaddr' into a thread local hash. This > thread local hash is lock free and dynamically expanded to avoid > collisions. Each populated entry in the thread local hash stores 'uaddr' > and a pointer to the 'global state' object. > > Futex ops: > > sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....); > > If the attached flag is set, then 'uaddr' is hashed and the thread local > hash is checked whether the hash entry contains 'uaddr'. If no, an error > code is returned. If yes, the hash slot number is stored in the futex key > which is used for further operations on the futex. When the hash bucket is > looked up then attached futexes will use the slot number to retrieve the > pointer to the 'global state' object and use the embedded hash bucket for > the operation. Non-attached futexes just use the global hash as before. > > Detach: > > sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....); > > Detach removes the entry in the thread local hash and decrements the > refcount on the 'global state' object. Once the refcount drops to zero the > 'global state' object is removed from the global hash and destroyed. > > Thread exit cleans up the thread local hash and the 'global state' objects > as we do for other futex related storage already. > > The thread local hash and the 'global state' object are allocated on the node > on which the attaching thread runs. > > Attached mode works with all futex operations and with both private and shared > futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both > futexes have to be either attached or detached (like FUTEX_PRIVATE). > > Why not auto attaching? > > Auto attaching has the following problems: > > - Memory consumption > - Life time issues > - Performance issues due to the necessary allocations But those are mostly setup only costs, right? So I don't think this conclusion is necessarily true, even on smaller systems: > So, no. It must be opt-in and reserved for explicit isolation purposes. > > A modified version of 'perf bench futex hash' shows the following results: and look at the very measurable performance advantages on a small NUMA system: Before: > Averaged 1451441 operations/sec (+- 3.65%), total secs = 60 After: > Averaged 1709712 operations/sec (+- 4.67%), total secs = 60 > That's a performance increase of 18%. ... and I suspect that on a larger NUMA system the speedup is probably a lot more pronounced. Also, the thing is, allocation/deallocation costs are a second order concern IMHO, because most of the futex's usage is the lock/unlock operations. So my prediction: in real life large systems will want to have collision-free futexes most of the time, and they don't want to modify every futex using application or library. So this is a mostly kernel side system sizing question/decision, not really a user-side system purpose policy question. So an ABI distinction and offloading the decision to every single application that wants to use it and hardcode it into actual application source code via an ABI is pretty much the _WORST_ way to go about it IMHO... So how about this: don't add any ABI details, but make futexes auto-attached on NUMA systems (and obviously PREEMPT_RT systems)? I.e. make it a build time or boot time decision at most, don't start a messy 'should we used attached futexes or not' decisions on the ABI side, which we know from Linux ABI history won't be answered and utilized very well by applications! Thanks, Ingo