Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755453AbcDKVz4 (ORCPT ); Mon, 11 Apr 2016 17:55:56 -0400 Received: from mail.efficios.com ([78.47.125.74]:55624 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753773AbcDKVzw (ORCPT ); Mon, 11 Apr 2016 17:55:52 -0400 Date: Mon, 11 Apr 2016 21:55:45 +0000 (UTC) From: Mathieu Desnoyers To: Andy Lutomirski Cc: Peter Zijlstra , "Paul E. McKenney" , Ingo Molnar , Paul Turner , Andi Kleen , Chris Lameter , Dave Watson , Josh Triplett , linux-api , linux-kernel@vger.kernel.org, Andrew Hunter , Linus Torvalds Message-ID: <1917687570.54466.1460411745854.JavaMail.zimbra@efficios.com> In-Reply-To: References: <20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com> <20160407120254.GY3448@twins.programming.kicks-ass.net> <20160407152432.GZ3448@twins.programming.kicks-ass.net> <20160407155312.GA3448@twins.programming.kicks-ass.net> Subject: Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [78.47.125.74] X-Mailer: Zimbra 8.6.0_GA_1178 (ZimbraWebClient - FF45 (Linux)/8.6.0_GA_1178) Thread-Topic: restartable sequences v2: fast user-space percpu critical sections Thread-Index: yEsW6LPMDXsePb4bfjUPxcMTSCNDJQ== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7909 Lines: 212 ----- On Apr 7, 2016, at 12:43 PM, Andy Lutomirski luto@amacapital.net wrote: > On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra wrote: >> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote: >>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra wrote: >>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote: >>> >> What I meant was: rather than shoving individual values into the TLABI >>> >> thing, shove in a pointer: >>> >> >>> >> struct commit_info { >>> >> u64 post_commit_rip; >>> >> u32 cpu; >>> >> u64 *event; >>> >> // whatever else; >>> >> }; >>> >> >>> >> and then put a commit_info* in TLABI. >>> >> >>> >> This would save some bytes in the TLABI structure. >>> > >>> > But would cost us extra indirections. The whole point was getting this >>> > stuff at a constant offset from the TLS segment register. >>> >>> I don't think the extra indirections would matter much. The kernel >>> would have to chase the pointer, but only in the very rare case where >>> it resumes userspace during a commit or on the immediately following >>> instruction. >> >> Its about userspace finding these values, not the kernel. > > But userspace never reads post_commit_rip or the abort address AFAIK. > >> >>> At the very least, post_commit_rip and the abort address (which I >>> forgot about) could both live in a static structure, >> >> Paul keeps the abort address in rcx. > > So it's probably a wash. Paul's way is to load the abort address in > rcx (one instruction, no memory access) and to put post_commit_rip in > TLABI (one store). Mine is to put a pointer to a descriptor into > TLABI (one store). Should barely matter. > >> >>> and shoving a >>> pointer to *that* into TLABI space is one store instead of two. >> >>> > Ah, so what happens if the signal happens before the commit but after >>> > the load of the seqcount? >>> > >>> > Then, even if the signal motifies the count, we'll not observe. >>> > >>> >>> Where exactly? >>> >>> In my scheme, nothing except the kernel ever loads the seqcount. The >>> user code generates a fresh value, writes it to memory, and then, just >>> before commit, writes that same value to the TLABI area and then >>> double-checks that the value it wrote at the beginning is still there. >>> >>> If the signal modifies the count, then the user code won't directly >>> notice, but prepare_exit_to_usermode on the way out of the signal will >>> notice that the (restored) TLABI state doesn't match the counter that >>> the signal handler changed and will just to the abort address. >> >> >> OK, you lost me.. commit looks like: >> >> + __asm__ __volatile__ goto ( >> + "movq $%l[failed], %%rcx\n" >> + "movq $1f, %[commit_instr]\n" >> + "cmpq %[start_value], %[current_value]\n" >> >> If we get preempted/signaled here without the preemption/signal entry >> checking for the post_commit_instr, we'll fail hard. > > Not if I'm thinking about this right. Because, in my scheme, we'd be > storing both the address of the counter [1] and the value that we > expect it to be (current_value? something's confused in the asm or > the variable naming, or I'm just not following exactly which variable > is which) in registers or TLABI, and the kernel would notice that > *counter != expected_counter when it resumes and would send us to the > abort label. > > Presumably there would be one register hard-coded as the expected > counter value and another register hard-coded as the address of the > counter. That would avoid a bunch of stores to TLABI, and the kernel > could still find them. If the C optimizer did a good job, the > resulting code would be simple. > > [1] In my scheme, there would be no reason for the counter to live in > TLABI. In contrast, there would be an excellent reason for it to > *not* live in TLABI -- user code could have more than one set of > counters, and they wouldn't interfere with each other. This would be > nice if one of them protects something that uses long enough code in > the critical sections that it's likely to be preempted. > >> >> + "jnz %l[failed]\n" >> + "movq %[to_write], (%[target])\n" >> + "1: movq $0, %[commit_instr]\n" >> + : /* no outputs */ >> + : [start_value]"d"(start_value.storage), >> + [current_value]"m"(__rseq_state), >> + [to_write]"r"(to_write), >> + [target]"r"(p), >> + [commit_instr]"m"(__rseq_state.post_commit_instr) >> + : "rcx", "memory" >> + : failed >> + ); > > > More concretely, this looks like (using totally arbitrary register > assingments -- probably far from ideal, especially given how GCC's > constraints work): > I think you have an ABA-style issue here: if you are migrated to another CPU at (A), and back to the original CPU at (B), your CPU check is not doing much, and you may be corrupting (%%r13) of a concurrently running thread. > enter the critical section: > 1: > movq %[cpu], %%r12 > movq {address of counter for our cpu}, %%r13 (A) -> preempted, migrated to remote CPU. > movq {some fresh value}, (%%r13) (B) -> migrated back to original CPU. > cmpq %[cpu], %%r12 > jne 1b My understanding is that you should only ever _read_ from such a per-cpu data structure upon entry into the critical section, not write, because you have no way to know whether you are migrated to another CPU at that point. Those per-cpu data ideas are interesting though. I've came up with the sketch of a scheme that fetches a seqcount from the percpu data, and keeps a copy in the TLS area. Whenever the kernel preempts a thread, it increments both the seqcount in the currently used percpu data _and_ the seqcount in the TLS area. Then checking if we need to restart the critical section is a matter of checking that the two counters match, else it means the percpu data counter has been touched by some other thread. I'll try to come up with an implementation of this idea, because there are a few more tricks to it to ensure all migration races are dealt with. One thing I doubt I'll be able to handle would be having nested restartable critical sections (e.g. one c.s. nested in another, but not through a nested signal handler). Not sure if we really care about this use-case though. Thanks, Mathieu > > ... do whatever setup or computation is needed... > > movq $%l[failed], %%rcx > movq $1f, %[commit_instr] > cmpq {whatever counter we chose}, (%%r13) > jne %l[failed] > cmpq %[cpu], %%r12 > jne %l[failed] > > <-- a signal in here that conflicts with us would clobber (%%r13), and > the kernel would notice and send us to the failed label > > movq %[to_write], (%[target]) > 1: movq $0, %[commit_instr] > > In contrast to Paul's scheme, this has two additional (highly > predictable) branches and requires generation of a seqcount in > userspace. In its favor, though, it doesnt need preemption hooks, > it's inherently debuggable, and it allows multiple independent > rseq-protected things to coexist without forcing each other to abort. > > The first cmpq might not be necessary. I put it there to ensure that > "some fresh value" is visible on the chosen CPU before any other loads > happen, but it's possible that the other cpu double-check is enough. > > > (Off topic: some of those movq instructions should probably be > rip-relative leaq instead.) > > If my descriptor approach were used, we'd save the write to rcx. > Instead we'd do: > > .pushsection .rodata > 2: > .quad %l[failed] - . /* or similar */ > .quad 1b - . > .popsection > leaq 2b(%rip), %[tlabi_rseq_desc] > > That's the optimization I was talking about above, although I did a > bad job of describing it. > > --Andy -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com