Date: Thu, 14 Dec 2017 18:12:57 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Chris Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        Andy Lutomirski <luto@amacapital.net>,
        Dave Watson <davejwatson@fb.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        linux-api <linux-api@vger.kernel.org>, Paul Turner <pjt@google.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Russell King <linux@arm.linux.org.uk>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Andrew Hunter <ahh@google.com>,
        Andi Kleen <andi@firstfloor.org>, Ben Maurer <bmaurer@fb.com>,
        rostedt <rostedt@goodmis.org>, Josh Triplett <josh@joshtriplett.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will.deacon@arm.com>,
        Michael Kerrisk <mtk.manpages@gmail.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>
Message-ID: <12046460.34426.1513275177081.JavaMail.zimbra@efficios.com>
In-Reply-To: <alpine.DEB.2.20.1712141033100.6249@nuc-kabylake>
References: <20171214161403.30643-1-mathieu.desnoyers@efficios.com> <20171214161403.30643-3-mathieu.desnoyers@efficios.com> <alpine.DEB.2.20.1712141033100.6249@nuc-kabylake>
Subject: Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable
 sequences system call (v12)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: rseq: Introduce restartable sequences system call (v12)
Thread-Index: dy+K5FN9TD5WdJx9V1/fjmLZC4kvWA==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2690
Lines: 69

----- On Dec 14, 2017, at 11:44 AM, Chris Lameter cl@linux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> I think the proper way to think about gs and fs on x86 is as base
> registers. They are essentially values in registers added to the address
> generated in an instruction. As such the approach is transferable to other
> processor architecture. Many support base register and base register
> relative processing. If a processor can do RMV instructions base register
> relative then you have something similar.

How would you do it on ARM32 ?

> 
> In a restartable sequence you could increase efficieny by avoiding full
> atomic instructions. This would be similar to the lockless RMV available
> on x86 then. And in that form it is portable.
> 
> A context switch to another processors would mean that the value of the
> base register has changed and that we therefore are accessing another per
> cpu segment. Restarting the sequence will yield a correct result without
> any reloading of registers.

As a concrete example, let's try to apply your proposal on a common use-case:
a compare-and-store on user-space per-cpu data.

With my rseq proposal the fast-path pseudo-code boils down to:

load TLS::cpu_id_start into reg_X
add reg_X offset to base to find target v
store pointer to TLS::rseq_cs
compare reg_X against TLS::cpu_id
jne abort
cmp *v, value
jne cmpfail
store newval to *v

My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration
for a test-case incrementing a counter with this rseq compare-and-store
sequence.

Let's assume we can reserve the gs segment selector for use in user-space,
and that the per-cpu data layout allows using this segment selector as offset.
The compare-and-store use-case would require a "cmpxchg" instruction with
a gs segment selector.

A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop
on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single
global value)

One benefit of your proposal is to lessen the number of retired instructions,
but if we take the IPC into account, it is slower than rseq in my benchmark. What
benefits do you expect from using segment selectors and non-lock-prefixed atomic
instructions on the fast-path ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com