Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751851AbaGVBBa (ORCPT ); Mon, 21 Jul 2014 21:01:30 -0400 Received: from www.linutronix.de ([62.245.132.108]:34523 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750980AbaGVBB1 (ORCPT ); Mon, 21 Jul 2014 21:01:27 -0400 Date: Tue, 22 Jul 2014 03:01:00 +0200 (CEST) From: Thomas Gleixner To: Darren Hart cc: Andy Lutomirski , Peter Zijlstra , Andi Kleen , Waiman Long , Ingo Molnar , Davidlohr Bueso , Heiko Carstens , "linux-kernel@vger.kernel.org" , Linux API , "linux-doc@vger.kernel.org" , Jason Low , Scott J Norton , Steven Rostedt Subject: Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex In-Reply-To: Message-ID: References: <1405956271-34339-1-git-send-email-Waiman.Long@hp.com> <8761iq3bp3.fsf@tassilo.jf.intel.com> <871tte3bjw.fsf@tassilo.jf.intel.com> <20140721212740.GS3935@laptop> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 21 Jul 2014, Darren Hart wrote: > On 7/21/14, 14:47, "Thomas Gleixner" wrote: > > >On Mon, 21 Jul 2014, Andy Lutomirski wrote: > >> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra > >>wrote: > >> > All this is predicated on the fact that syscalls are 'expensive'. > >> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far > >> > more expensive due to cacheline misses, which due to the size of the > >> > things is almost guaranteed. > >> > >> 120 - 300 cycles for me, unless tracing happens, and I'm working on > >> reducing the incidence of tracing. > > > >So it's a non issue indeed and definitely not worth the trouble of > >that extra storage, the scheduler overhead, etc. > > > >Summary: Once you can't take it atomically in user space, you've lost > > anyway. And we are better off to do the magic spinning in > > kernel where we have all the information accessible already. > > And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we > discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of > your and other's comments: > > http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f > utex-lock/v7 > > > I can work on forward porting this series to current mainline (post recent > security fixes) and cleaning up the commentary and such if people are > interested in seeing this implementation (based on Peter Z's spinning > mutex work iirc) resurrected... No. We really want to avoid more magic hackery in the futex code. Lets sit down and think about what we need: 1) Support for all existing features 2) More fine grained locking 3) Optimistic spinning @1: No discussion about that. Period. We are not going to introduce some new futex opcode, which is optimized for a microbenchmark and ignoring all of the pain we went through in the last 10 years. No way, especially after the recent security disaster. @2 and @3: Yes, we want that. But again, we don't add fine grained locking just for some half baken new opcode. No, we are adding it where it makes the most sense and lets us reuse most of the code. I can predict your question, how that should work :) If we want to have fine grained locking beyond the futex hash buckets and that's something we discussed before, you need a state representation of the user space futex in the kernel. That's what Waiman added as well, just in a way that's beyond repair. The charm of the futex hash buckets is that they do not create kernel state because the futex_q which is enqueued into the bucket is on the task stack of the task which is blocked on the futex. So much for the theory, because that stopped to be true when we added support for PI futexes. They already create kernel internal state which is not magically removed when the thread exits the syscall. It's called pi_state. Now the question is how is this related to the non PI magic spinning optimization? Very much so. Simply because the handling of pi_state already covers ALL the corner cases which come with the extra kernel internal state and they provide the full function set of futexes required by the various (ab)use cases including the requeue functionality which is required for condvars. It even supports caching of the pi_state struct in a way which makes sense, rather than blindly slapping a cached entry onto each hash bucket. So it's not a really mind boggling thought experiment to think about this magic new feature as a subset of the already existing PI futex code. It is a subset, just minus the PI portion plus the more fine grained locking. So the right solution is to rework the existing code. 1) Make pi_state the entity which gets enqueued into the hash bucket and manage the waiters in a pi_state internal queue. Straight forward problem as it is still protected by the hash bucket lock. 2) Add private locking to the pi_state to reduce the lock contention on the hash bucket lock. Straight forward conversion as well 3) Make mutex a subset of rt_mutex That makes a lot of sense, simply because a mutex is the same as a rtmutex just without the PI part. Just for the record before anyone cries murder: The original mutex implemention was modeled after the rtmutex one in the first place and the optimistic spinning was first introduced on rtmutexes by Gregory Haskins et. al. in the context of the RT tree. Back then it was too early or we were simply too stupid to see the similarities and opted for a complete separate implementation. Yes, I'm aware that this is a non-trivial problem, but if you look at the details, then the non-trivial part is in the slow path were stuff actually goes to sleep. The fast path of mutex and rt_mutex is simply the same. So there is no reason to keep them separate. An extra conditional in the slow path is not going to hurt at all. Surely, you might say, that I'm an egoistic bastard, who just wants to have the benefit of MCS for rtmutex for free. Right you are, but there is a whole bunch of reasons why this makes tons of sense: - As I said above, it's almost the same, except for the slow path details, where a few extra conditionals do not matter - Sharing the code has the obvious benefits. And it's not only sharing between rtmutex and mutex, it's also sharing the underlying optimizations of MCS for the kernel internal users and the user space exposed ones, i.e. futexes. - Once unified the desired futex functionality just works with a minimal impact on the futex code itself. The futex code does not care whether the underlying primitive has PI semantics or not, it just cares that all the required features like proxy locking etc. are in place. And that everything like timeouts, requeue etc. is just supported. - The semantics of the new futex functionality has not to be defined again. We can simply reuse the rather well defined and well tested straight forward semantics of the existing PI opcodes. - That minimizes the risk massively, because all state and error handling is common code. - Any update to the underlying primitives will benefit ALL usage sites and reduces the copy and paste induced hell of MCS code sprinkled all over the kernel. 4) Enable user space to make use of it. Whether this will be a separate opcode or just a new flag to the existing PI opcodes is going to be just a bikeshedding question at that point. :) The important point is, that we do NOT grow any new unmaintainable warts to the futex code. And the 650+ new lines of hackery which come with the proposed patch set are in no way acceptable, especially as they only cover the minimalistic functionality set of futexes. Not to talk about the other issues I've observed on the first glance. So before anyone comes up with a "solution" for all of this tomorrow afternoon in form of another half baken patch, please sit back mull it in your head and lets have a proper discussion about the approach first. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/