Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753857AbdLNSKr (ORCPT ); Thu, 14 Dec 2017 13:10:47 -0500 Received: from mail.efficios.com ([167.114.142.141]:58507 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753624AbdLNSKo (ORCPT ); Thu, 14 Dec 2017 13:10:44 -0500 Date: Thu, 14 Dec 2017 18:12:57 +0000 (UTC) From: Mathieu Desnoyers To: Chris Lameter Cc: Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Andy Lutomirski , Dave Watson , linux-kernel , linux-api , Paul Turner , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Hunter , Andi Kleen , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Catalin Marinas , Will Deacon , Michael Kerrisk , Alexander Viro Message-ID: <12046460.34426.1513275177081.JavaMail.zimbra@efficios.com> In-Reply-To: References: <20171214161403.30643-1-mathieu.desnoyers@efficios.com> <20171214161403.30643-3-mathieu.desnoyers@efficios.com> Subject: Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.142.141] X-Mailer: Zimbra 8.7.11_GA_1854 (ZimbraWebClient - FF52 (Linux)/8.7.11_GA_1854) Thread-Topic: rseq: Introduce restartable sequences system call (v12) Thread-Index: dy+K5FN9TD5WdJx9V1/fjmLZC4kvWA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2690 Lines: 69 ----- On Dec 14, 2017, at 11:44 AM, Chris Lameter cl@linux.com wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > >> On x86, yet another possible approach would be to use the gs segment >> selector to point to user-space per-cpu data. This approach performs >> similarly to the cpu id cache, but it has two disadvantages: it is >> not portable, and it is incompatible with existing applications already >> using the gs segment selector for other purposes. > > I think the proper way to think about gs and fs on x86 is as base > registers. They are essentially values in registers added to the address > generated in an instruction. As such the approach is transferable to other > processor architecture. Many support base register and base register > relative processing. If a processor can do RMV instructions base register > relative then you have something similar. How would you do it on ARM32 ? > > In a restartable sequence you could increase efficieny by avoiding full > atomic instructions. This would be similar to the lockless RMV available > on x86 then. And in that form it is portable. > > A context switch to another processors would mean that the value of the > base register has changed and that we therefore are accessing another per > cpu segment. Restarting the sequence will yield a correct result without > any reloading of registers. As a concrete example, let's try to apply your proposal on a common use-case: a compare-and-store on user-space per-cpu data. With my rseq proposal the fast-path pseudo-code boils down to: load TLS::cpu_id_start into reg_X add reg_X offset to base to find target v store pointer to TLS::rseq_cs compare reg_X against TLS::cpu_id jne abort cmp *v, value jne cmpfail store newval to *v My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration for a test-case incrementing a counter with this rseq compare-and-store sequence. Let's assume we can reserve the gs segment selector for use in user-space, and that the per-cpu data layout allows using this segment selector as offset. The compare-and-store use-case would require a "cmpxchg" instruction with a gs segment selector. A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single global value) One benefit of your proposal is to lessen the number of retired instructions, but if we take the IPC into account, it is slower than rseq in my benchmark. What benefits do you expect from using segment selectors and non-lock-prefixed atomic instructions on the fast-path ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com