Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757890AbcDGWFt (ORCPT ); Thu, 7 Apr 2016 18:05:49 -0400 Received: from mail-oi0-f47.google.com ([209.85.218.47]:36153 "EHLO mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757858AbcDGWFr (ORCPT ); Thu, 7 Apr 2016 18:05:47 -0400 MIME-Version: 1.0 In-Reply-To: <20160407201156.GC3448@twins.programming.kicks-ass.net> References: <20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com> <20160407120254.GY3448@twins.programming.kicks-ass.net> <20160407152432.GZ3448@twins.programming.kicks-ass.net> <20160407155312.GA3448@twins.programming.kicks-ass.net> <20160407201156.GC3448@twins.programming.kicks-ass.net> From: Andy Lutomirski Date: Thu, 7 Apr 2016 15:05:26 -0700 Message-ID: Subject: Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections To: Peter Zijlstra Cc: Mathieu Desnoyers , "Paul E. McKenney" , Ingo Molnar , Paul Turner , Andi Kleen , Chris Lameter , Dave Watson , Josh Triplett , Linux API , "linux-kernel@vger.kernel.org" , Andrew Hunter , Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3615 Lines: 113 On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote: >> More concretely, this looks like (using totally arbitrary register >> assingments -- probably far from ideal, especially given how GCC's >> constraints work): >> >> enter the critical section: >> 1: >> movq %[cpu], %%r12 >> movq {address of counter for our cpu}, %%r13 >> movq {some fresh value}, (%%r13) >> cmpq %[cpu], %%r12 >> jne 1b >> >> ... do whatever setup or computation is needed... >> >> movq $%l[failed], %%rcx >> movq $1f, %[commit_instr] >> cmpq {whatever counter we chose}, (%%r13) >> jne %l[failed] >> cmpq %[cpu], %%r12 >> jne %l[failed] >> >> <-- a signal in here that conflicts with us would clobber (%%r13), and >> the kernel would notice and send us to the failed label >> >> movq %[to_write], (%[target]) >> 1: movq $0, %[commit_instr] > > And the kernel, for every thread that has had the syscall called and a > thingy registered, needs to (at preempt/signal-setup): > > if (get_user(post_commit_ip, current->post_commit_ip)) > return -EFAULT; > > if (likely(!post_commit_ip)) > return 0; > > if (regs->ip >= post_commit_ip) > return 0; > > if (get_user(seq, (u32 __user *)regs->r13)) > return -EFAULT; > > if (regs->$(which one holds our chosen seq?) == seq) { > /* nothing changed, do not cancel, proceed to commit. */ > return 0; Only return zero if regs->${which one holds the cpu) == smp_processor_id(). > } > > if (put_user(0UL, current->post_commit_ip)) > return -EFAULT; > > regs->ip = regs->rcx; I was imagining this happening at (return to userspace or preempt) and possibly at signal return, but yes, more or less. > > >> In contrast to Paul's scheme, this has two additional (highly >> predictable) branches and requires generation of a seqcount in >> userspace. In its favor, though, it doesnt need preemption hooks, > > Without preemption hooks, how would one thread preempting another at the > above <-- clobber anything and cause the commit to fail? It doesn't, which is what I like about my variant. If the thread accesses the protected data structure, though, it should bump the sequence count, which will cause the first thread to about when it gets scheduled in. > >> it's inherently debuggable, > > It is more debuggable, agreed. > >> and it allows multiple independent >> rseq-protected things to coexist without forcing each other to abort. > > And the kernel only needs to load the second cacheline if it lands in > the middle of a finish block, which should be manageable overhead I > suppose. > > But the userspace chunk is lots slower as it needs to always touch > multiple lines, since the @cpu, @seq and @post_commit_ip all live in > separate lines (although I suppose @cpu and @post_commit_ip could live > in the same). > > The finish thing needs 3 registers for: > > - fail ip > - seq pointer > - seq value > > Which I suppose is possible even on register constrained architectures > like i386. I think this can all be munged into two cachelines: One cacheline contains the per-thread CPU number and post_commit_ip (either by doing it over Linus' dead body or by having userspace allocate it carefully). The other contains the sequence counter *and* the percpu data structure that's protected. So in some sense it's the same number of cache lines as Paul's version. --Andy -- Andy Lutomirski AMA Capital Management, LLC