Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755982AbcDGMDA (ORCPT ); Thu, 7 Apr 2016 08:03:00 -0400 Received: from casper.infradead.org ([85.118.1.10]:59209 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755908AbcDGMC6 (ORCPT ); Thu, 7 Apr 2016 08:02:58 -0400 Date: Thu, 7 Apr 2016 14:02:54 +0200 From: Peter Zijlstra To: Andy Lutomirski Cc: Paul Turner , Mathieu Desnoyers , "Paul E. McKenney" , Ingo Molnar , "linux-kernel@vger.kernel.org" , Chris Lameter , Andi Kleen , Josh Triplett , Dave Watson , Andrew Hunter , Linus Torvalds , Linux API Subject: Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections Message-ID: <20160407120254.GY3448@twins.programming.kicks-ass.net> References: <20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3068 Lines: 79 On Wed, Apr 06, 2016 at 08:56:33AM -0700, Andy Lutomirski wrote: > What are the useful commit operations? IIUC they probably need to be > single instructions, which makes it sound like they're all either RMW > or store operations. I think that plain stores are sufficient to > emulate RMW since (1) if the value changes out from under you, you'll > abort, and (2) most CPUs don't have single instruction RMW anyway. Yeah so stores are sufficient. The only requirement is that the store is a single operation/instruction. So you can typically not commit anything wider than the native word size. > We could probably speed up commits and make the code a bit more > obvious by feeding the kernel a pointer to a descriptor instead of > feeding it individual values. For example, the descriptor could be: See the Thread-Local-ABI effort by Mathieu, the idea is to get a single full cacheline (64bytes) fixed size thingy allocated at a fixed offset to the TCB. That way we can reduce to %[gf]s:offset for these here variables (I forever forget which segment register userspace uses for TLS). > Is your scheme safe against signals that are delivered during commit? > Would the signal context need to save and zero the commit state? The patches you comment on here explicitly update the event from the signal frame setup and thereby handle this. The update not only increments the sequence count, but also tests the post_commit_ip thing, if set it assigns fail value to regs->ip (ie the commit fails). tl;dr, yes its safe against signals happening during commit. > I still want to see if we can get away from the kernel-managed event > counter. Would the following work: > > start_sequence: > read current CPU number > change event counter > re-read current CPU number [1] > > commit: > tell kernel we're committing > re-check event counter and CPU number > do the commit instruction > tell kernel we're done committing > > [1] avoids a memory ordering issue if we migrate before changing the > event counter So currently the event and cpu form a single 64bit value, so that on 64bit we can use a single load and cmp to verify them. So letting the user manage the event is doable, but it would still be advisable to have the event in the same shared word. > The kernel forces an abort if, on resume from any kernel entry, the > CPU number or event counter is wrong. > > If this worked, then it would be inherently debuggable, since the only > way that it would abort in a single-threaded situation is if it > migrated during commit. (or a signal happened, as per the below) Tempting.. not signal safe though, although I suppose we can still explicitly do the: if (regs->ip < post_commit_ip) regs->ip = regs->rcx; thing on signal frame setup to abort any in-flight commit without explicitly incrementing the sequence number. Not having to mange the event count from kernel reduces the kernel work to migration only -- ie. we can get rid of the preemption hook and reduce overhead there, something I'm entirely happy with if possible.